Machine Teaching. for Personalized Education, Security, Interactive Machine Learning. Jerry Zhu

Similar documents
Optimal Teaching for Online Perceptrons

Machine Teaching and its Applications

How can a Machine Learn: Passive, Active, and Teaching

Machine Learning And Applications: Supervised Learning-SVM

Discriminative Models

Generalization, Overfitting, and Model Selection

Discriminative Models

6.036 midterm review. Wednesday, March 18, 15

ECS171: Machine Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Generalization and Overfitting

Statistical Data Mining and Machine Learning Hilary Term 2016

Toward Adversarial Learning as Control

Max Margin-Classifier

Midterm Exam, Spring 2005

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Advanced Introduction to Machine Learning CMU-10715

Machine Learning (CS 567) Lecture 2

Support Vector Machines

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Bits of Machine Learning Part 1: Supervised Learning

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Machine Learning 4771

Lecture 3 Classification, Logistic Regression

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

ML (cont.): SUPPORT VECTOR MACHINES

Lecture 18: Multiclass Support Vector Machines

Linear Models in Machine Learning

10-701/ Machine Learning - Midterm Exam, Fall 2010

Linear Classifiers. Michael Collins. January 18, 2012

SVRG++ with Non-uniform Sampling

Linear Classifiers IV

ABC-LogitBoost for Multi-Class Classification

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CIS 520: Machine Learning Oct 09, Kernel Methods

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Lecture 2 Machine Learning Review

PAC-learning, VC Dimension and Margin-based Bounds

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Support Vector Machines for Classification and Regression

Kernel Methods & Support Vector Machines

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Recommendation. Tobias Scheffer

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Support Vector Machines

CS229 Supplemental Lecture notes

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Does Unlabeled Data Help?

Statistical Properties of Large Margin Classifiers

Reward-modulated inference

CSCI-567: Machine Learning (Spring 2019)

CPSC 540: Machine Learning

Bayesian Learning (II)

Blind Attacks on Machine Learners

Logistic Regression Trained with Different Loss Functions. Discussion

Machine Learning for Biomedical Engineering. Enrico Grisan

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

ECS289: Scalable Machine Learning

Importance Sampling for Minibatches

Foundations of Machine Learning

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Kernel Methods and Support Vector Machines

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

STA141C: Big Data & High Performance Statistical Computing

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Computational Learning Theory. CS534 - Machine Learning

PAC-learning, VC Dimension and Margin-based Bounds

Qualifier: CS 6375 Machine Learning Spring 2015

Linear & nonlinear classifiers

Click Prediction and Preference Ranking of RSS Feeds

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Introduction to Support Vector Machines

Lecture 8. Instructor: Haipeng Luo

CSE 546 Final Exam, Autumn 2013

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Support Vector Machines

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

CSE 546 Midterm Exam, Fall 2014

Fast Logistic Regression for Text Categorization with Variable-Length N-grams

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Support Vector Machine

Ensembles of Classifiers.

A Magiv CV Theory for Large-Margin Classifiers

STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING. Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo

Learning with multiple models. Boosting.

STA141C: Big Data & High Performance Statistical Computing

ECS289: Scalable Machine Learning

Lecture 3: Statistical Decision Theory (Part II)

Support Vector Machine. Industrial AI Lab.

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Generalization, Overfitting, and Model Selection

Statistical aspects of prediction models with high-dimensional data

Adversarial Label Flips Attack on Support Vector Machines

Qualifying Exam in Machine Learning

Transcription:

Machine Teaching for Personalized Education, Security, Interactive Machine Learning Jerry Zhu NIPS 2015 Workshop on Machine Learning from and for Adaptive User Technologies

Supervised Learning Review D: training set (x 1, y 1 )... (x n, y n ) Learning algorithm A : A(D) = {θ} Example: regularized empirical risk minimization A(D) = argmin θ n l(θ, x i, y i ) + λω(θ) i=1 Example: version space learner A(D) = {θ consistent with D}

Machine Teaching = Inverse Machine Learning Given: learning algorithm A, target model θ Find: the smallest D such that A(D) = {θ } D A A(D) θ 1 Α (θ ) D Α 1 Θ

Machine Teaching Example Teach a 1D threshold classifier Given: A = SVM, θ = 0.6 What is the smallest D?

Machine Teaching Example Teach a 1D threshold classifier Given: A = SVM, θ = 0.6 What is the smallest D?

Machine Teaching as Communication sender = teacher message = θ decoder = A codeword = D

Machine Teaching Stronger Than Active Learning Sample complexity to achieve ɛ error: passive learning n = O (1/ɛ)) active learning n = O (log(1/ɛ)): needs binary search machine teaching n = 2: teaching dimension [Goldman + Kearns 1995], the teacher knows θ θ θ θ { O(1/n) O(1/2 ) passive learning "waits" active learning "explores" teaching "guides" { n

Why Machine Teaching? Why bother if we already know θ? 1. education 2. computer security 3. interactive machine learning

Education Teacher answers three questions: 1. My student s cognitive model A is a (a) SVM (b) logistic regression (c) neural net... 2. Student success is defined by (a) learning a target model θ (b) excel on a test set 3. My teaching effort is defined by (a) training set size (b) training item cost... Machine teaching finds the optimal, personalized lesson D for the student.

Education Example [Patil et al. 2014] Human categorization task: 1D threshold θ = 0.5 A: kernel density estimator Optimal D: Teaching humans with D: y= 1 y=1 0 θ =0.5 1 human trained on human test accuracy optimal D 72.5% random items 69.8% (statistically significant)

Security: Data Poisoning Attack [Alfeld et al. 2016], [Mei + Zhu 2015] Given: learner A, attack target θ, clean training data D 0 Find: the minimum poison δ such that A(D 0 + δ) = {θ }

Security: Lake Mendota Data 140 120 ICE DAYS 100 80 60 40 1900 1920 1940 1960 1980 2000 YEAR Lake Mendota, Wisconsin ice days

Security: Optimal Attacks on Ordinary Least Squares min δ, β δ p s.t. β1 0 β = argmin (y + δ) Xβ 2 β 140 120 300 250 ICE DAYS 100 80 60 original, β 1 = 0.1 2 norm attack, β 1 =0 40 1900 1920 1940 1960 1980 2000 YEAR ICE DAYS 200 150 100 50 original, β 1 = 0.1 1 norm attack, β 1 =0 0 1900 1920 1940 1960 1980 2000 YEAR minimize δ 2 2 minimize δ 1

Security: Optimal Attack on Latent Dirichlet Allocation [Mei + Zhu 2015]

Optimization for Machine Teaching Bilevel optimization min D D s.t. D θ = argmin θ Θ 1 D (x i,y i ) D l(x i, y i, θ) + Ω(θ) In general min D D TeachingLoss(A(D), θ ) + TeachingEffort(D) Sometimes closed-form solution [Alfeld et al. 2016] [Zhu 2013] Often nonconvex optimization [Bo Zhang, this workshop]

Theoretical Question: Teaching Dimension Given: A, θ Find: the smallest D such that A(D) = {θ } Teaching dimension T D = D

Theoretical Question: Teaching Dimension Given: A, θ Find: the smallest D such that A(D) = {θ } Teaching dimension T D = D A = version space learner: T D =? [Goldman + Kearns 1995] A = SV M: T D = 2 Learner-specific teaching dimension

Teaching Dimension Example: Teaching an SVM Learner A(D) = arg min θ R 2 n i=1 max(1 y ix i θ, 0) + 1 2 θ 2 Target model: θ = ( 1 2, 3 2 ) One training item is necessary and sufficient: T D(θ, A) = 1 x 1 = ( 1 2, 3 2 ), y 1 = 1

The Teaching Dimension of Linear Learners [Liu & Zhu 2015] A(D) = argmin θ R d n l(θ x i, y i ) + λ 2 θ 2 2 i=1 goal homogeneous inhomogeneous θ = [w ; b ] ridge SVM logistic ridge SVM logistic parameter 1 λ θ 2 λ θ 2 τ max 2 2 λ w 2 2 2 λ w 2 2τ max boundary - 1 1-2 2

The Teaching Dimension of Linear Learners [Liu & Zhu 2015] A(D) = argmin θ R d n l(θ x i, y i ) + λ 2 θ 2 2 i=1 goal homogeneous inhomogeneous θ = [w ; b ] ridge SVM logistic ridge SVM logistic parameter 1 λ θ 2 λ θ 2 τ max 2 2 λ w 2 2 2 λ w 2 2τ max boundary - 1 1-2 2 Teaching Dimension (independent of d) distinct from VC-dimension (d + 1)

Interactive Machine Learning You train a classifier Only use item, label pairs (x 1, y 1 )... (x n, y n ) Cost = n to reach small ɛ error

Speed of Light [Goldman+Kearns 1995], [Angluin 2004], [Cakmak+Thomaz 2011], [Zhu et al. in preparation] Fundamental Cost of Interactive Machine Learning n Teaching Dimension Optimal teacher achieves n =Teaching Dimension Can be much faster than active learning (recall 2 vs. log 1 ɛ ) Must allow teacher-initiated items (unlike active learning) But some humans are bad teachers...

Some Humans are Suboptimal Teachers [Khan, Zhu, Mutlu 2011] Challenge for the computer: make human behave more like the optimal teacher

Summary Machine teaching Theory: teaching dimension Algorithm: bilevel optimization Applications: education, security, interactive machine learning http://pages.cs.wisc.edu/~jerryzhu/machineteaching/