Classifier Adaptation at Prediction Time

Size: px

Start display at page:

Download "Classifier Adaptation at Prediction Time"

Malcolm Warner
5 years ago
Views:

1 Classifier Adaptation at Prediction Time or How Bayes rule might help you to reduce your error rate by half Christoph Lampert Yandex, Moscow September 8th, 2016

IST Austria (Institute of Science and Technology Austria), Vienna New public research institute Natural and formal sciences - Computer Science, Mathematics, Biology, Neuroscience, Physics

2 IST Austria (Institute of Science and Technology Austria), Vienna New public research institute Natural and formal sciences - Computer Science, Mathematics, Biology, Neuroscience, Physics PhD-granting, no undergrad Basic, curiosity-driven research Focus on interdisciplinary Open positions in all fields IST Austria Graduate School Postdoc Fellowships Tenure-Track Assistant Professors Full Professors Internships, Reseach Visits, Sabbaticals,... More information: or chl@ist.ac.at

3 Long term goal Automatic systems that can analyze and interpret data Image Understanding Three men sit at a table in a pub, drinking beer. One of them talks while the other two listen. Image: British Broadcasting Corporation (BBC)

4 State of the art Analyze individual aspects of visual data indoors in a pub Scene Classification drinking talking Action Classification Object Recognition three persons one table three glasses

Crucial Step: Object Recognition object? person: 0.

5 Crucial Step: Object Recognition object? person: bottle: cake: truck: car: table: tiger: zebra: Image: Tony Alter, under Creative Commons

6 Object recognition has gone large scale: big data Image: Forsyth, Efros, Fei-Fei, Torralba, Zisserman, "The Promise and Perils of Benchmark Datasets and Challenges", 2011.

convolutional neural networks", NIPS 2011.

7 Object recognition has gone complex: deep networks Image left: Krizhevsky, Sutskever, Hinton, "Imagenet classification with deep convolutional neural networks", NIPS Image right: adapted from He, Zhang, Ren, Sun,"Deep Residual Learning for Image Recognition", arxiv:

8 Object recognition has gone expensive: HPC/GPU clusters Image: "The CSIRO GPU cluster at the data centre" by CSIRO. Licensed under CC BY 3.0 via Wikimedia Commons

9 Don t train object classifiers yourself. Order them pre-trained. Image: faked

10 Research Challenge Image Understanding with Pretrained Classifiers

11 Academic setting Independent identically distributed data at training and prediction time Image: ImageNet dataset Image:

12 Vendor Customer 1 Domain Shift Image: ImageNet dataset Image: "Supermarkt". Licensed under GFDL via Wikimedia Commons

Vendor Customer 2 Domain Shift, Dependent Samples Image: ImageNet dataset http://image-net.

13 Vendor Customer 2 Domain Shift, Dependent Samples Image: ImageNet dataset Image: "Baggage Claim at CPH" by Duhhitsminerva. Licensed under CC BY 3.0 via Wikimedia Commons

14 Vendor Customer 3 Domain Shift, Dependent Samples, Non-Stationary Distribution Image: ImageNet dataset Image: Christoph Lampert 2015

15 Dependent Samples Academic setting: training and test data are sampled i.i.d. images are independent, identically distributed Real-life prediction tasks: very much non-i.i.d. surveillance: temporal dependences between images photo collections: specific selection of themes

16 Dependent Samples Academic setting: training and test data are sampled i.i.d. images are independent, identically distributed Real-life prediction tasks: very much non-i.i.d. surveillance: temporal dependences between images photo collections: specific selection of themes We argue: This is a blessing, not a nuisance! some shop Images: ImageNet dataset

17 Dependent Samples Academic setting: training and test data are sampled i.i.d. images are independent, identically distributed Real-life prediction tasks: very much non-i.i.d. surveillance: temporal dependences between images photo collections: specific selection of themes We argue: This is a blessing, not a nuisance! earlier images act as context bakery Images: ImageNet dataset

18 Domain Shift Notation: x X images, y Y = {1,..., K} class labels P(x, y) data distribution at training time (vendor) Q(x, y) data distribution at prediction time (customer) Domain shift: P(x, y) Q(x, y)

19 Domain Shift Notation: x X images, y Y = {1,..., K} class labels P(x, y) data distribution at training time (vendor) Q(x, y) data distribution at prediction time (customer) Domain shift: P(x, y) Q(x, y) Three cases: P(y x) = Q(y x), but P(x) Q(x) P(x y) Q(x y) P(x y) = Q(x y), but P(y) Q(y) covariate shift appearance shift class prior shift

20 Domain Shift Notation: x X images, y Y = {1,..., K} class labels P(x, y) data distribution at training time (vendor) Q(x, y) data distribution at prediction time (customer) Domain shift: P(x, y) Q(x, y) Three cases: P(y x) = Q(y x), but P(x) Q(x) P(x y) Q(x y) P(x y) = Q(x y), but P(y) Q(y) covariate shift appearance shift class prior shift

21 Domain Shift Image: [Donahue et al., ICML 2014] Appearance shift is mitigated by invariant features

22 Domain Shift Notation: x X images, y Y = {1,..., K} class labels P(x, y) data distribution at training time (vendor) Q(x, y) data distribution at prediction time (customer) Domain shift: P(x, y) Q(x, y) Three cases: P(y x) = Q(y x), but P(x) Q(x) P(x y) Q(x y) P(x y) = Q(x y), but P(y) Q(y) covariate shift appearance shift class prior shift

23 Domain Shift training time: P(y) typically balanced, P(y) 1 K e.g. in ILSVR2014, as many volcanos as cucumbers prediction time: Q(y) supermarket: Q(y) lots of fruit, most likely no volcanos airport: Q(y) lots of people and baggage, also no volcanos vacation: Q(y) occasional volcanos, but more beaches highly imbalanced low entropy easy to learn! Class prior shift is real, but also potentially beneficial.

24 Classifier Adaptation at Prediction Time Amélie Royer ENS Rennes/IST Austria [A. Royer, CHL, "Classifier Adaptation at Prediction Time", CVPR 2015]

25 Class Prior Adaptation Training time: optimal multi-class classifier f : X Y f (x) = argmax f y (x) for f y (x) P(y x). y Y Prediction time: optimal multi-class classifier g : X Y g(x) = argmax g y (x) for g y (x) Q(y x). y Y For P(x y) = Q(x y), but P(y) Q(y), Q(y x) = P(y x)p(x)q(y) P(y)Q(x) f y (x) Q(y) P(y) Optimal classifier: g(x) = argmax y Y f y (x) Q(y) P(y).

26 Class Prior Adaptation probabilistic classifier f (x) = argmax y f y (x), with f y : X R class proportions at training time ρ R K, i.e. ρ y = P(y) class proportions at prediction time π R K, i.e. π y = Q(y) Definition: The class-prior adaptation of f from ρ to π is g(x) = argmax g y (x) for g y (x) = f y(x)π y. y Y ρ y Note: no retraining, only adjust output scores. [Saerens et al., 2002] Lemma: g is Bayes-optimal for Q(x, y)-distributed data, if P(x, y) differs from Q(x, y) only in the class proportions and f y (x) = P(y x).

27 Class Prior Adaptation probabilistic classifier f (x) = argmax y f y (x), with f y : X R class proportions at training time ρ R K, i.e. ρ y = P(y) class proportions at prediction time π R K, i.e. π y = Q(y) Definition: The class-prior adaptation of f from ρ to π is g(x) = argmax g y (x) for g y (x) = f y(x)π y. y Y ρ y Note: no retraining, only adjust output scores. [Saerens et al., 2002] Lemma: g is Bayes-optimal for Q(x, y)-distributed data, if P(x, y) differs from Q(x, y) only in the class proportions and f y (x) = P(y x).

28 Class-Prior Adaptation during Sequential Prediction Problem: In the vendor/customer scenario, the class proportion at prediction time, π, are unknown. Solution: learn proportions on-the-fly at prediction time

29 Class-Prior Adaptation during Sequential Prediction Problem: In the vendor/customer scenario, the class proportion at prediction time, π, are unknown. Solution: learn proportions on-the-fly at prediction time Sequential prediction scenario: images to be classified arrive sequentially, x 1, x 2,... goal: for each x t make prediction g(x t ) three possible feedback scenarios: online: after prediction the correct label, yt, is revealed e.g. supermarket cash register bandit: after prediction it is revealed if a mistake was made e.g. augmented reality glasses unsupervised: no feedback about correct labels e.g. surveillance

30 Example (no adaptation) image x t f cat (x t ) f dog (x t ) f truck (x t ) prediction f (x t ) online feedback bandit feedback no feedback

31 Example (no adaptation) image x t f cat (x t ) 0.8 f dog (x t ) 0.1 f truck (x t ) 0.1 prediction f (x t ) cat online feedback bandit feedback no feedback

32 Example (no adaptation) image x t f cat (x t ) 0.8 f dog (x t ) 0.1 f truck (x t ) 0.1 prediction f (x t ) cat online feedback bandit feedback no feedback cat

33 Example (no adaptation) image x t f cat (x t ) 0.8 f dog (x t ) 0.1 f truck (x t ) 0.1 prediction f (x t ) cat online feedback bandit feedback! no feedback

34 Example (no adaptation) image x t f cat (x t ) 0.8 f dog (x t ) 0.1 f truck (x t ) 0.1 prediction f (x t ) cat online feedback bandit feedback no feedback

35 Example (no adaptation) image x t f cat (x t ) 0.8 f dog (x t ) 0.1 f truck (x t ) 0.1 prediction f (x t ) cat online feedback cat bandit feedback! no feedback

36 Example (no adaptation) image x t f cat (x t ) f dog (x t ) f truck (x t ) prediction f (x t ) cat truck online feedback cat bandit feedback! no feedback

37 Example (no adaptation) image x t f cat (x t ) f dog (x t ) f truck (x t ) prediction f (x t ) cat truck online feedback cat dog bandit feedback! no feedback

38 Example (no adaptation) image x t f cat (x t ) f dog (x t ) f truck (x t ) prediction f (x t ) cat truck online feedback cat bandit feedback! % no feedback

39 Example (no adaptation) image x t f cat (x t ) f dog (x t ) f truck (x t ) prediction f (x t ) cat truck online feedback cat bandit feedback! no feedback

40 Example (no adaptation) image x t f cat (x t ) f dog (x t ) f truck (x t ) prediction f (x t ) cat truck online feedback cat dog bandit feedback! % no feedback

41 Example (no adaptation) image x t f cat (x t ) f dog (x t ) f truck (x t ) prediction f (x t ) cat truck dog online feedback cat dog bandit feedback! % no feedback

42 Example (no adaptation) image x t f cat (x t ) f dog (x t ) f truck (x t ) prediction f (x t ) cat truck dog online feedback cat dog dog bandit feedback! % no feedback

43 Example (no adaptation) image x t f cat (x t ) f dog (x t ) f truck (x t ) prediction f (x t ) cat truck dog online feedback cat dog bandit feedback! %! no feedback

44 Example (no adaptation) image x t f cat (x t ) f dog (x t ) f truck (x t ) prediction f (x t ) cat truck dog online feedback cat dog bandit feedback! % no feedback

45 Example (no adaptation) image x t f cat (x t ) f dog (x t ) f truck (x t ) prediction f (x t ) cat truck dog online feedback cat dog dog bandit feedback! %! no feedback

46 Example (no adaptation) image x t f cat (x t ) f dog (x t ) f truck (x t ) prediction f (x t ) cat truck dog cat online feedback cat dog dog cat bandit feedback! %!! no feedback

47 Example (no adaptation) image x t f cat (x t ) f dog (x t ) f truck (x t ) prediction f (x t ) cat truck dog cat online feedback cat dog dog cat bandit feedback! %!! no feedback

48 Example (no adaptation) image x t f cat (x t ) f dog (x t ) f truck (x t ) prediction f (x t ) cat truck dog cat truck online feedback cat dog dog cat dog bandit feedback! %!! % no feedback

49 Example (no adaptation) image x t f cat (x t ) f dog (x t ) f truck (x t ) prediction f (x t ) cat truck dog cat truck online feedback cat dog dog cat dog bandit feedback! %!! % no feedback...

50 Estimating Class Priors Examples, x 1, x 2,..., with labels, y 1, y 2,... Task: estimate class proportions, (π y ) y Y Smoothed Maximum Likelihood (aka Bayesian estimator with Dirichlet prior): π y (t) = n t(y) + α for α > 0, (e.g. α = 1 t + Kα 2 ) t where n t (y) = y τ = y τ=1 counts how often each label occurred so far. (preferable to ML estimator, π (t) y = nt (y), which assigns 0 probability to unseen classes) t

51 Online Feedback π (t) y = n t(y) + α t + Kα t for n t (y) = y τ = y τ=1 after prediction g(x t ), correct label y t is revealed compute n t (y) incrementally n t (y) = n t 1 (y) + y t = y

52 Online Feedback π (t) y = n t(y) + α t + Kα t for n t (y) = y τ = y τ=1 after prediction g(x t ), correct label y t is revealed compute n t (y) incrementally n t (y) = n t 1 (y) + y t = y Law of large numbers: π (t) converges to the true class distribution. This holds even for dependent samples (under weak conditions).

53 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, 1 3 image x t f cat (x t ) f dog (x t ) f truck (x t ) g cat (x t ) g dog (x t ) g truck (x t ) prediction g(x t ) feedback y t π update

54 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, 1 3 image x t f cat (x t ) 0.8 f dog (x t ) 0.1 f truck (x t ) 0.1 g cat (x t ) g dog (x t ) g truck (x t ) prediction g(x t ) feedback y t π update

55 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, 1 3 image x t f cat (x t ) 0.8 f dog (x t ) 0.1 f truck (x t ) 0.1 g cat (x t ) 0.8 g dog (x t ) 0.1 g truck (x t ) 0.1 prediction g(x t ) cat feedback y t π update

56 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, 1 3 image x t f cat (x t ) 0.8 f dog (x t ) 0.1 f truck (x t ) 0.1 g cat (x t ) 0.8 g dog (x t ) 0.1 g truck (x t ) 0.1 prediction g(x t ) cat feedback y t cat π update n(cat) += 1

57 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, , 1 4, 1 4 image x t f cat (x t ) 0.8 f dog (x t ) 0.1 f truck (x t ) 0.1 g cat (x t ) 0.8 g dog (x t ) 0.1 g truck (x t ) 0.1 prediction g(x t ) cat feedback y t cat π update n(cat) += 1

58 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, , 1 4, 1 4 image x t f cat (x t ) 0.8 f dog (x t ) 0.1 f truck (x t ) 0.1 g cat (x t ) 0.8 g dog (x t ) 0.1 g truck (x t ) 0.1 prediction g(x t ) cat feedback y t cat π update n(cat) += 1

59 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, , 1 4, 1 4 image x t f cat (x t ) f dog (x t ) f truck (x t ) g cat (x t ) 0.8 g dog (x t ) 0.1 g truck (x t ) 0.1 prediction g(x t ) cat feedback y t cat π update n(cat) += 1

60 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, , 1 4, 1 4 image x t f cat (x t ) f dog (x t ) f truck (x t ) g cat (x t ) g dog (x t ) g truck (x t ) prediction g(x t ) cat truck feedback y t cat π update n(cat) += 1

61 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, , 1 4, 1 4 image x t f cat (x t ) f dog (x t ) f truck (x t ) g cat (x t ) g dog (x t ) g truck (x t ) prediction g(x t ) cat truck feedback y t cat dog π update n(cat) + = 1 n(dog) + = 1

62 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, , 1 4, , 2 5, 1 5 image x t f cat (x t ) f dog (x t ) f truck (x t ) g cat (x t ) g dog (x t ) g truck (x t ) prediction g(x t ) cat truck feedback y t cat dog π update n(cat) + = 1 n(dog) + = 1

63 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, , 1 4, , 2 5, 1 5 image x t f cat (x t ) f dog (x t ) f truck (x t ) g cat (x t ) g dog (x t ) g truck (x t ) prediction g(x t ) cat truck feedback y t cat dog π update n(cat) + = 1 n(dog) + = 1

64 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, , 1 4, , 2 5, 1 5 image x t f cat (x t ) f dog (x t ) f truck (x t ) g cat (x t ) g dog (x t ) g truck (x t ) prediction g(x t ) cat truck feedback y t cat dog π update n(cat) + = 1 n(dog) + = 1

65 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, , 1 4, , 2 5, 1 5 image x t f cat (x t ) f dog (x t ) f truck (x t ) g cat (x t ) g dog (x t ) g truck (x t ) prediction g(x t ) cat truck dog feedback y t cat dog π update n(cat) + = 1 n(dog) + = 1

66 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, , 1 4, , 2 5, 1 5 image x t f cat (x t ) f dog (x t ) f truck (x t ) g cat (x t ) g dog (x t ) g truck (x t ) prediction g(x t ) cat truck dog feedback y t cat dog dog π update n(cat) + = 1 n(dog) + = 1 n(dog) + = 1

67 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, , 1 4, , 2 5, , 3 6, 1 6 image x t f cat (x t ) f dog (x t ) f truck (x t ) g cat (x t ) g dog (x t ) g truck (x t ) prediction g(x t ) cat truck dog feedback y t cat dog dog π update n(cat) + = 1 n(dog) + = 1 n(dog) + = 1

68 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, , 1 4, , 2 5, , 3 6, 1 6 image x t f cat (x t ) f dog (x t ) f truck (x t ) g cat (x t ) g dog (x t ) g truck (x t ) prediction g(x t ) cat truck dog feedback y t cat dog dog π update n(cat) + = 1 n(dog) + = 1 n(dog) + = 1

69 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, , 1 4, , 2 5, , 3 6, 1 6 image x t f cat (x t ) f dog (x t ) f truck (x t ) g cat (x t ) g dog (x t ) g truck (x t ) prediction g(x t ) cat truck dog feedback y t cat dog dog π update n(cat) + = 1 n(dog) + = 1 n(dog) + = 1

70 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, , 1 4, , 2 5, , 3 6, 1 6 image x t f cat (x t ) f dog (x t ) f truck (x t ) g cat (x t ) g dog (x t ) g truck (x t ) prediction g(x t ) cat truck dog cat feedback y t cat dog dog π update n(cat) + = 1 n(dog) + = 1 n(dog) + = 1

71 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, , 1 4, , 2 5, , 3 6, 1 6 image x t f cat (x t ) f dog (x t ) f truck (x t ) g cat (x t ) g dog (x t ) g truck (x t ) prediction g(x t ) cat truck dog cat feedback y t cat dog dog cat π update n(cat) + = 1 n(dog) + = 1 n(dog) + = 1 n(cat) + = 1

72 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, , 1 4, , 2 5, , 3 6, , 3 7, 1 7 image x t f cat (x t ) f dog (x t ) f truck (x t ) g cat (x t ) g dog (x t ) g truck (x t ) prediction g(x t ) cat truck dog cat feedback y t cat dog dog cat π update n(cat) + = 1 n(dog) + = 1 n(dog) + = 1 n(cat) + = 1

73 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, , 1 4, , 2 5, , 3 6, , 3 7, 1 7 image x t f cat (x t ) f dog (x t ) f truck (x t ) g cat (x t ) g dog (x t ) g truck (x t ) prediction g(x t ) cat truck dog cat feedback y t cat dog dog cat π update n(cat) + = 1 n(dog) + = 1 n(dog) + = 1 n(cat) + = 1

74 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, , 1 4, , 2 5, , 3 6, , 3 7, 1 7 image x t f cat (x t ) f dog (x t ) f truck (x t ) g cat (x t ) g dog (x t ) g truck (x t ) prediction g(x t ) cat truck dog cat feedback y t cat dog dog cat π update n(cat) + = 1 n(dog) + = 1 n(dog) + = 1 n(cat) + = 1

75 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, , 1 4, , 2 5, , 3 6, , 3 7, 1 7 image x t f cat (x t ) f dog (x t ) f truck (x t ) g cat (x t ) g dog (x t ) g truck (x t ) prediction g(x t ) cat truck dog cat dog(!) feedback y t cat dog dog cat π update n(cat) + = 1 n(dog) + = 1 n(dog) + = 1 n(cat) + = 1

76 Example (online feedback) π (cat,dog,truck) 1 3, 1 3, , 1 4, , 2 5, , 3 6, , 3 7, image x t f cat (x t ) f dog (x t ) f truck (x t ) g cat (x t ) g dog (x t ) g truck (x t ) prediction g(x t ) cat truck dog cat dog(!) feedback y t cat dog dog cat dog π update n(cat) + = 1 n(dog) + = 1 n(dog) + = 1 n(cat) + =

77 Bandit Feedback π (t) y = n t(y) + α t + Kα t for n t (y) = y τ = y τ=1 after prediction g(x t ), it is revealed if the prediction was correct estimate n t (y) incrementally n t (y) = n t 1 (y) + δ t (y) if decision was correct: δ t (y) = y t = y 0 for y = g(x t ). if decision was incorrect: δ t (y) = 1 otherwise (also possible: δ t(y) Q (t) (y x t), for y g(x t)) K 1

78 Example (bandit feedback) π (cat,dog,truck) 1 3, 1 3, , 1 4, , 1.5 5, , 2.5 6, , 2.5 7, image x t f cat (x t ) f dog (x t ) f truck (x t ) g cat (x t ) g dog (x t ) g truck (x t ) pred. g(x t ) cat truck dog cat dog(!) feedback y t! %!!! π update n(cat) += 1 n(cat) += 0.5 n(dog) += 0.5 n(dog) + = 1 n(cat) + =

79 Unsupervised (No Feedback) π (t) y = n t(y) + α t + Kα t for n t (y) = y τ = y τ=1 no information if prediction g(x t ) was correct or not estimate n t (y) by trusting own predictions (self-training) n t (y) = n t 1 (y) + δ t (y) for { } g (t) δ t (y) = Eȳ Q (t) (ȳ x t) y = ȳ = y (x t ) ȳ g (t) ȳ (x t ). No guarantee, but can be expected to work for decent base classifiers.

Example (no feedback) π (cat,dog,truck) 1 3, 1 3, 1 3 1.

7 7, 2.4 7, 1.8 7... image x t f cat (x t ) 0.8 0.1 0.2 0.5 0.

1 0.2 0.5 g cat (x t ) 0.8 0.15 0.25 0.53 0.13 g dog (x t ) 0.

42 pred. g(x t ) cat truck dog cat dog(!

1 n(truck)+ = 0.1 n(cat) += 0.15 n(dog) += 0.38 n(truck)+ = 0.

80 Example (no feedback) π (cat,dog,truck) 1 3, 1 3, , 1.1 4, , 1.5 5, , 2.1 6, , 2.4 7, image x t f cat (x t ) f dog (x t ) f truck (x t ) g cat (x t ) g dog (x t ) g truck (x t ) pred. g(x t ) cat truck dog cat dog(!) feedback y t π update n(cat) += 0.8 n(dog) += 0.1 n(truck)+ = 0.1 n(cat) += 0.15 n(dog) += 0.38 n(truck)+ = 0.47 n(cat) += 0.25 n(dog) += 0.65 n(truck)+ = 0.10 n(cat) += 0.53 n(dog) += 0.31 n(truck)+ =

81 Extension: Non-Stationary Data Distribution What if data distribution changes, e.g. mobile camera? Sliding window estimate: adapt only to recent past (e.g. L = 100). π (t) y = n t(y) + α L + Kα, with n t(y) = t τ=t L+1 δ τ (y),

82 Realistic Image Sequences How to benchmark such an adaptive classification system? We need realistic image sequences: non-uniform class distribution dependent samples non-stationary distribution

83 Realistic Image Sequences How to benchmark such an adaptive classification system? We need realistic image sequences: non-uniform class distribution dependent samples non-stationary distribution Proposal: three methods based on existing i.i.d. corpus (ILSVRC) KS/MDS: Hidden Markov Model structure random walk between classes for each visited class, sample one image TXT: based on class structure in natural language for each class occurrence in a text document, sample one image

84 Realistic Image Sequences: MDS Apply Multi-Dimensional Scaling (MDS) to ImageNet hierarchy Random walk on k-nn graph For each visited class, sample one image from ILSVRC corpus Properties: highly connected semantic clusters, random walk stays within one "topic" for an extended time

85 Realistic Image Sequences: KS Apply Kernelized Sorting (KS) to ImageNet hierarchy Random walk on resulting grid graph For each visited class, sample one image from ILSVRC corpus Properties: similar classes are close, but no cluster structure, random walk frequently "changes topic"

86 Introducing Context Switches Extend MDS/KS to allow "jumps" called MDS(λ), KS(λ) Introduce parameter λ > 0 Instead of taking a random walk step, jump to arbitrary (random) node in the graph with probability λ. Result: homogeneous subsequences of variable length

87 Realistic Image Sequences: TXT Given: corpus of well-formed English texts (project Gutenberg) Generate image sequence: discard non-nouns from text scan noun sequence for class names or ImageNet hypernyms if leaf in hierarchy (cucumber): sample image from that class if interior node (dog): sample random leaf from subtree (tibetan mastiff), sample image from leaf class... when the rabbit actually took a watch out of its waistcoat-pocket and looked at it and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it,... (Excerpt from Alice in Wonderland, italics: nouns, bold: ILSVRC2010 (super-)classes)

88 Introduction Challenges Class-Prior Adaptation Realistic Image Sequences Experiments Example Sequences TXT rabbit watch rabbit watch rabbit rabbit jar orange jar... MDS asparagus jalapeno green onion jalapeno jalapeno kidney bean pumpkin french fries... nematode sea cucumber snow leopard leopard leopard leopard mink weasel... KS RND speedboat coral reef burrito lionfish envelope fur coat trifle paddle punching bag Example label sequences and test images for TXT, MDS, KS, compared to uniform i.i.d. (RND). Images: ImageNet dataset...

89 Experimental Setup Base datasets: ILSVRC2010, ILSVRC2012 (val part) Base classifiers (pre-trained): Convolutional Neural Network (libccv, AlexNet style) SVM with 4K-dim. Fisher vectors (yael/jsgd) + Platt scaling Methods: base classifier base classifier + adaptation (+adapt) base classifier + windowed adaptation (+dyn) Test sets: 100 sequences each of MDS, MDS(λ) for λ {0.001, 0.01, 0.1} length 3000 KS, KS(λ) for λ {0.001, 0.01, 0.1} length 3000 TXT variable length, (avg. 3475) RND length 3000 Error measures: top-1 error rate, top-5 error rate

90 Experimental Setup Base datasets: ILSVRC2010, ILSVRC2012 (val part) Base classifiers (pre-trained): Convolutional Neural Network (libccv, AlexNet style) SVM with 4K-dim. Fisher vectors (yael/jsgd) + Platt scaling Methods: base classifier base classifier + adaptation (+adapt) base classifier + windowed adaptation (+dyn) Test sets: 100 sequences each of MDS, MDS(λ) for λ {0.001, 0.01, 0.1} length 3000 KS, KS(λ) for λ {0.001, 0.01, 0.1} length 3000 TXT variable length, (avg. 3475) RND length 3000 Error measures: top-1 error rate, top-5 error rate

91 Results ILSVRC2012 CNN CNN+adapt CNN+dyn TXT 19.8 ± ± ± 1.7 MDS 16.1 ± ± ± 2.6 MDS(0.001) 15.6 ± ± ± 1.9 MDS(0.01) 15.7 ± ± ± 1.1 MDS(0.1) 16.2 ± ± ± 0.7 KS 16.4 ± ± ± 1.3 KS(0.001) 16.5 ± ± ± 1.2 KS(0.01) 16.4 ± ± ± 1.0 KS(0.1) 16.5 ± ± ± 0.8 RND 16.5 ± ± ± 0.6 Online Feedback (each cell: top-5 error [%], mean and std.dev. over 100 sequences)

92 Results ILSVRC2012 CNN CNN+adapt CNN+dyn TXT 19.8 ± ± ± 1.7 MDS 16.1 ± ± ± 2.6 MDS(0.001) 15.6 ± ± ± 1.9 MDS(0.01) 15.7 ± ± ± 1.1 MDS(0.1) 16.2 ± ± ± 0.7 KS 16.4 ± ± ± 1.3 KS(0.001) 16.5 ± ± ± 1.2 KS(0.01) 16.4 ± ± ± 1.0 KS(0.1) 16.5 ± ± ± 0.8 RND 16.5 ± ± ± 0.6 Online Feedback (each cell: top-5 error [%], mean and std.dev. over 100 sequences)

93 Results ILSVRC2012 CNN CNN+adapt CNN+dyn TXT 19.8 ± ± ± 1.7 MDS 16.1 ± ± ± 2.6 MDS(0.001) 15.6 ± ± ± 1.9 MDS(0.01) 15.7 ± ± ± 1.1 MDS(0.1) 16.2 ± ± ± 0.7 KS 16.4 ± ± ± 1.3 KS(0.001) 16.5 ± ± ± 1.2 KS(0.01) 16.4 ± ± ± 1.0 KS(0.1) 16.5 ± ± ± 0.8 RND 16.5 ± ± ± 0.6 Online Feedback (each cell: top-5 error [%], mean and std.dev. over 100 sequences)

94 Results ILSVRC2012 CNN CNN+adapt CNN+dyn TXT 19.8 ± ± ± 1.7 MDS 16.1 ± ± ± 2.6 MDS(0.001) 15.6 ± ± ± 1.9 MDS(0.01) 15.7 ± ± ± 1.1 MDS(0.1) 16.2 ± ± ± 0.7 KS 16.4 ± ± ± 1.3 KS(0.001) 16.5 ± ± ± 1.2 KS(0.01) 16.4 ± ± ± 1.0 KS(0.1) 16.5 ± ± ± 0.8 RND 16.5 ± ± ± 0.6 Online Feedback (each cell: top-5 error [%], mean and std.dev. over 100 sequences)

95 Results ILSVRC2012 CNN CNN+adapt CNN+dyn TXT 19.8 ± ± ± 1.7 MDS 16.1 ± ± ± 3.3 MDS(0.001) 15.6 ± ± ± 2.6 MDS(0.01) 15.7 ± ± ± 1.4 MDS(0.1) 16.2 ± ± ± 0.8 KS 16.4 ± ± ± 1.6 KS(0.001) 16.5 ± ± ± 1.6 KS(0.01) 16.4 ± ± ± 1.3 KS(0.1) 16.5 ± ± ± 0.8 RND 16.5 ± ± ± 0.6 Bandit Feedback (each cell: top-5 error [%], mean and std.dev. over 100 sequences)

96 Results ILSVRC2012 CNN CNN+adapt CNN+dyn TXT 19.8 ± ± ± 1.7 MDS 16.1 ± ± ± 2.9 MDS(0.001) 15.6 ± ± ± 2.7 MDS(0.01) 15.7 ± ± ± 1.5 MDS(0.1) 16.2 ± ± ± 0.8 KS 16.4 ± ± ± 1.5 KS(0.001) 16.5 ± ± ± 1.5 KS(0.01) 16.4 ± ± ± 1.1 KS(0.1) 16.5 ± ± ± 0.8 RND 16.5 ± ± ± 0.6 Unsupervised (No Feedback) (each cell: top-5 error [%], mean and std.dev. over 100 sequences)

97 Summary Observations: soon we will buy computer vision components pre-trained new kinds of research problems in real problems the images to be classified are not uniform i.i.d.: class imbalance, dependent samples, non-stationary distribution Contributions: classifier adaptation with on-the-fly estimation of class priors, oblivious of underlying base classifiers, only adjust scores three methods for creating dependent test image sequences Results: on-the-fly adaptation can reduce the error rate substantially for good enough base classifiers, no additional supervision needed

98 Thanks to... The team at IST Austria: Alex Kolesnikov Georg Martius Asya Pentina Amélie Royer Alex Zimin Funding Sources:

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert HSE Computer Science Colloquium September 6, 2016 IST Austria (Institute of Science and Technology