Joint Emotion Analysis via Multi-task Gaussian Processes

Similar documents
Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Nonparameteric Regression:

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

MTTTS16 Learning from Multiple Sources

Introduction to Gaussian Processes

GAUSSIAN PROCESS REGRESSION

Gaussian Processes (10/16/13)

Reliability Monitoring Using Log Gaussian Process Regression

Probabilistic & Unsupervised Learning

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Multi-task Learning with Gaussian Processes, with Applications to Robot Inverse Dynamics

Gaussian Process Regression

Nonparametric Bayesian Methods (Gaussian Processes)

20: Gaussian Processes

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Variational Model Selection for Sparse Gaussian Process Regression

Optimization of Gaussian Process Hyperparameters using Rprop

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Introduction to Gaussian Process

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Supervised Learning Coursework

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Introduction to Gaussian Processes

Relevance Vector Machines

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Modelling gene expression dynamics with Gaussian processes

Probabilistic & Bayesian deep learning. Andreas Damianou

Gaussian Processes in Machine Learning

Prediction of double gene knockout measurements

Multivariate Bayesian Linear Regression MLAI Lecture 11

Efficient Modeling of Latent Information in Supervised Learning using Gaussian Processes

Lecture 5: GPs and Streaming regression

Non-Linear Regression for Bag-of-Words Data via Gaussian Process Latent Variable Set Model

Modelling Transcriptional Regulation with Gaussian Processes

arxiv: v2 [cs.cl] 1 Jan 2019

Markov Chain Monte Carlo Algorithms for Gaussian Processes

arxiv: v1 [stat.ml] 27 May 2017

Pattern Recognition and Machine Learning

Neutron inverse kinetics via Gaussian Processes

Massachusetts Institute of Technology

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

Machine learning - HT Basis Expansion, Regularization, Validation

Model Selection for Gaussian Processes

GWAS V: Gaussian processes

Density Estimation. Seungjin Choi

Multi-task Gaussian Process Prediction

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Multi-Kernel Gaussian Processes

BAYESIAN CLASSIFICATION OF HIGH DIMENSIONAL DATA WITH GAUSSIAN PROCESS USING DIFFERENT KERNELS

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

Introduction to Gaussian Processes

Variable sigma Gaussian processes: An expectation propagation perspective

ECE521 week 3: 23/26 January 2017

Probabilistic Models for Learning Data Representations. Andreas Damianou

Introduction to Gaussian Processes

Click Prediction and Preference Ranking of RSS Feeds

STA 4273H: Sta-s-cal Machine Learning

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification

CS534 Machine Learning - Spring Final Exam

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Probabilistic Reasoning in Deep Learning

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Gaussian with mean ( µ ) and standard deviation ( σ)

arxiv: v1 [cs.cl] 30 Jun 2016

Learning Gaussian Process Models from Uncertain Data

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Hierarchical Time-series Modelling for Haemodialysis. Tingting Zhu 24 th October 2016

Gaussian Processes for Machine Learning

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Kernels for Automatic Pattern Discovery and Extrapolation

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

STAT 518 Intro Student Presentation

Advanced Introduction to Machine Learning CMU-10715

Gaussian Process Latent Variable Models for Dimensionality Reduction and Time Series Modeling

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Gaussian Process Regression Networks

CMU-Q Lecture 24:

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen

Introduction to SVM and RVM

Data Mining & Machine Learning

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Correlation Changepoints in Log-Gaussian Cox Processes

Inferring Latent Functions with Gaussian Processes in Differential Equations

Bayesian Support Vector Machines for Feature Ranking and Selection

STA414/2104 Statistical Methods for Machine Learning II

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Transcription:

Joint Emotion Analysis via Multi-task Gaussian Processes Daniel Beck, Trevor Cohn, Lucia Specia October 28, 2014

1 Introduction 2 Multi-task Gaussian Process Regression 3 Experiments and Discussion 4 Conclusions and Future Work 2 / 23

1 Introduction 2 Multi-task Gaussian Process Regression 3 Experiments and Discussion 4 Conclusions and Future Work 3 / 23

Emotion Analysis Goal Automatically detect emotions in a text [Strapparava and Mihalcea, 2008]; 4 / 23

Emotion Analysis Goal Automatically detect emotions in a text [Strapparava and Mihalcea, 2008]; Headline Fear Joy Sadness Storms kill, knock out power, cancel flights 82 0 60 Panda cub makes her debut 0 59 0 4 / 23

Why Multi-task? Learn a model that shows sound and interpretable correlations between emotions. 5 / 23

Why Multi-task? Learn a model that shows sound and interpretable correlations between emotions. Datasets are scarce and small Multi-task models are able to learn from all emotions jointly; 5 / 23

Why Multi-task? Learn a model that shows sound and interpretable correlations between emotions. Datasets are scarce and small Multi-task models are able to learn from all emotions jointly; Annotation scheme is subjective and fine-grained Prone to bias and noise; 5 / 23

Why Multi-task? Learn a model that shows sound and interpretable correlations between emotions. Datasets are scarce and small Multi-task models are able to learn from all emotions jointly; Annotation scheme is subjective and fine-grained Prone to bias and noise; Disclaimer: this work is not about features (at the moment...) 5 / 23

Multi-task learning and Anti-correlations Most multi-task models used in NLP assume some degree of correlation between tasks: 6 / 23

Multi-task learning and Anti-correlations Most multi-task models used in NLP assume some degree of correlation between tasks: Domain Adaptation: assumes the existence of a general domain-independent knowledge in the data. 6 / 23

Multi-task learning and Anti-correlations Most multi-task models used in NLP assume some degree of correlation between tasks: Domain Adaptation: assumes the existence of a general domain-independent knowledge in the data. Annotation Noise Modelling: assumes that annotations are noisy deviations from a ground truth. 6 / 23

Multi-task learning and Anti-correlations Most multi-task models used in NLP assume some degree of correlation between tasks: Domain Adaptation: assumes the existence of a general domain-independent knowledge in the data. Annotation Noise Modelling: assumes that annotations are noisy deviations from a ground truth. For Emotion Analysis, we need a multi-task model that is able to take into account possible anti-correlations, avoiding negative transfer. Headline Fear Joy Sadness Storms kill, knock out power, cancel flights 82 0 60 Panda cub makes her debut 0 59 0 6 / 23

1 Introduction 2 Multi-task Gaussian Process Regression 3 Experiments and Discussion 4 Conclusions and Future Work 7 / 23

Gaussian Processes Let (X, y) be the training data and f (x) the latent function that models that data: 8 / 23

Gaussian Processes Let (X, y) be the training data and f (x) the latent function that models that data: f (x) GP(µ(x), k(x, x )) 8 / 23

Gaussian Processes Let (X, y) be the training data and f (x) the latent function that models that data: f (x) GP(µ(x), k(x, x )) Mean function 8 / 23

Gaussian Processes Let (X, y) be the training data and f (x) the latent function that models that data: f (x) GP(µ(x), k(x, x )) Kernel function 8 / 23

Gaussian Processes Let (X, y) be the training data and f (x) the latent function that models that data: f (x) GP(µ(x), k(x, x )) p(f X, y) = p(y X, f )p(f ) p(y X) Prior 8 / 23

Gaussian Processes Let (X, y) be the training data and f (x) the latent function that models that data: f (x) GP(µ(x), k(x, x )) Likelihood p(y X, f )p(f ) p(f X, y) = p(y X) 8 / 23

Gaussian Processes Let (X, y) be the training data and f (x) the latent function that models that data: f (x) GP(µ(x), k(x, x )) p(f X, y) = p(y X, f )p(f ) p(y X) Marginal likelihood 8 / 23

Gaussian Processes Let (X, y) be the training data and f (x) the latent function that models that data: f (x) GP(µ(x), k(x, x )) Posterior p(f X, y) = p(y X, f )p(f ) p(y X) p(y x, X, y) = p(y x, f, X, y)p(f X, y)df f 8 / 23

Gaussian Processes Let (X, y) be the training data and f (x) the latent function that models that data: f (x) GP(µ(x), k(x, x )) p(f X, y) = p(y X, f )p(f ) p(y X) p(y x, X, y) = p(y x, f, X, y)p(f X, y)df f Likelihood (test) 8 / 23

Gaussian Processes Let (X, y) be the training data and f (x) the latent function that models that data: f (x) GP(µ(x), k(x, x )) p(f X, y) = p(y X, f )p(f ) p(y X) p(y x, X, y) = p(y x, f, X, y)p(f X, y)df Predictive distribution f 8 / 23

GP Regression Likelihood: In a regression setting, we usually consider a Gaussian likelihood, which allow us to obtain a closed form solution for the test posterior; 1 AKA Squared Exponential, Gaussian or Exponential Quadratic kernel. 9 / 23

GP Regression Likelihood: In a regression setting, we usually consider a Gaussian likelihood, which allow us to obtain a closed form solution for the test posterior; Kernel: Many options available. In this work we use the Radial Basis Function (RBF) kernel 1 : ( ) k(x, x ) = αf 2 exp 1 F (x i x i )2 2 l i i=1 1 AKA Squared Exponential, Gaussian or Exponential Quadratic kernel. 9 / 23

The Intrinsic Coregionalisation Model Coregionalisation models extend GPs to vector-valued outputs [Álvarez et al., 2012]. Here we use the Intrinsic Coregionalisation Model (ICM): k((x, d), (x, d )) = k data (x, x ) B d,d 10 / 23

The Intrinsic Coregionalisation Model Coregionalisation models extend GPs to vector-valued outputs [Álvarez et al., 2012]. Here we use the Intrinsic Coregionalisation Model (ICM): k((x, d), (x, d )) = k data (x, x ) B d,d Kernel on data points (like RBF, for instance) 10 / 23

The Intrinsic Coregionalisation Model Coregionalisation models extend GPs to vector-valued outputs [Álvarez et al., 2012]. Here we use the Intrinsic Coregionalisation Model (ICM): k((x, d), (x, d )) = k data (x, x ) B d,d Coregionalisation matrix: encodes task covariances 10 / 23

The Intrinsic Coregionalisation Model Coregionalisation models extend GPs to vector-valued outputs [Álvarez et al., 2012]. Here we use the Intrinsic Coregionalisation Model (ICM): k((x, d), (x, d )) = k data (x, x ) B d,d B can be parameterised and learned by optimizing the model marginal likelihood. 10 / 23

PPCA model [Bonilla et al., 2008] decomposes B using PPCA: B = UΛU T + diag(α), 11 / 23

PPCA model [Bonilla et al., 2008] decomposes B using PPCA: B = UΛU T + diag(α), To ensure numerical stability, we employ the incomplete-cholesky decomposition over UΛU T : B = L L T + diag(α), 11 / 23

PPCA model L 11 L 21 L 31 L 41 L 51 L 61 L 12 / 23

PPCA model L 11 L 21 L 31 L 41 L 51 L 61 L 11L 21L 31L 41L 51L 61 L L T 12 / 23

PPCA model L 11 L 21 L 31 L 41 L 51 L 61 L 11L 21L 31L 41L 51L 61 α 1 α 2 α 3 + diag( α ) = 4 α 5 α 6 L L T + diag(α) = 12 / 23

PPCA model L 11 L 21 L 31 L 41 L 51 L 61 L 11L 21L 31L 41L 51L 61 α 1 α 2 α 3 + diag( α ) = 4 α 5 α 6 L L T + diag(α) = B 12 / 23

PPCA model 12 hyperparameters L 11 L 21 L 31 L 41 L 51 L 61 L 11L 21L 31L 41L 51L 61 α 1 α 2 α 3 + diag( α ) = 4 α 5 α 6 L L T + diag(α) = B 12 / 23

PPCA model 18 hyperparameters L 11 L 12 L 21 L 22 L 31 L 32 L 41 L 42 L 51 L 52 L 61 L 62 L 11L 21L 31L 41L 51L 61 α 1 L 12L 22L 32L 42L 52L 62 α 2 α 3 + diag( α ) = 4 α 5 α 6 L L T + diag(α) = B 13 / 23

PPCA model 24 hyperparameters L 11 L 12 L 13 L 21 L 22 L 23 L 31 L 32 L 33 L 41 L 42 L 43 L 51 L 52 L 53 L 61 L 62 L 63 L 11L 21L 31L 41L 51L 61 L 12L 22L 32L 42L 52L 62 L 13L 23L 33L 43L 53L 63 α 1 α 2 α 3 + diag( α ) = 4 α 5 α 6 L L T + diag(α) = B 14 / 23

1 Introduction 2 Multi-task Gaussian Process Regression 3 Experiments and Discussion 4 Conclusions and Future Work 15 / 23

Experimental Setup Dataset: SEMEval2007 Affective Text [Strapparava and Mihalcea, 2007]; 16 / 23

Experimental Setup Dataset: SEMEval2007 Affective Text [Strapparava and Mihalcea, 2007]; 1000 News headlines, each one annotated with 6 scores [0-100], one for emotion; 16 / 23

Experimental Setup Dataset: SEMEval2007 Affective Text [Strapparava and Mihalcea, 2007]; 1000 News headlines, each one annotated with 6 scores [0-100], one for emotion; Bag-of-words representation as features; 16 / 23

Experimental Setup Dataset: SEMEval2007 Affective Text [Strapparava and Mihalcea, 2007]; 1000 News headlines, each one annotated with 6 scores [0-100], one for emotion; Bag-of-words representation as features; Pearson s correlation score as evaluation metric; 16 / 23

Learned Task Covariances 100 sentences 17 / 23

Prediction Results Split: 100/900 18 / 23

Prediction Results Split: 100/900 18 / 23

Prediction Results Split: 100/900 18 / 23

Training Set Size Influence Split: 100+/100 19 / 23

1 Introduction 2 Multi-task Gaussian Process Regression 3 Experiments and Discussion 4 Conclusions and Future Work 20 / 23

Conclusions and Future Work Conclusions 21 / 23

Conclusions and Future Work Conclusions The proposed model is able to learn sensible correlations and anti-correlations; 21 / 23

Conclusions and Future Work Conclusions The proposed model is able to learn sensible correlations and anti-correlations; For small datasets, it also outperforms single-task baselines; 21 / 23

Conclusions and Future Work Conclusions The proposed model is able to learn sensible correlations and anti-correlations; For small datasets, it also outperforms single-task baselines; Future Work 21 / 23

Conclusions and Future Work Conclusions The proposed model is able to learn sensible correlations and anti-correlations; For small datasets, it also outperforms single-task baselines; Future Work Modelling the label distribution (different priors, different likelihoods) 21 / 23

Conclusions and Future Work Conclusions The proposed model is able to learn sensible correlations and anti-correlations; For small datasets, it also outperforms single-task baselines; Future Work Modelling the label distribution (different priors, different likelihoods) Multiple multi-task levels (for example, MTurk data [Snow et al., 2008]); 21 / 23

Conclusions and Future Work Conclusions The proposed model is able to learn sensible correlations and anti-correlations; For small datasets, it also outperforms single-task baselines; Future Work Modelling the label distribution (different priors, different likelihoods) Multiple multi-task levels (for example, MTurk data [Snow et al., 2008]); Other multi-task GP models [Álvarez et al., 2012, Hensman et al., 2013]; 21 / 23

Joint Emotion Analysis via Multi-task Gaussian Processes Daniel Beck, Trevor Cohn, Lucia Specia October 28, 2014

Error Analysis 23 / 23

Álvarez, M. A., Rosasco, L., and Lawrence, N. D. (2012). Kernels for Vector-Valued Functions: a Review. Foundations and Trends in Machine Learning, pages 1 37. Bonilla, E. V., Chai, K. M. A., and Williams, C. K. I. (2008). Multi-task Gaussian Process Prediction. Advances in Neural Information Processing Systems. Cohn, T. and Specia, L. (2013). Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation. In Proceedings of ACL. Hensman, J., Lawrence, N. D., and Rattray, M. (2013). Hierarchical Bayesian modelling of gene expression time series across irregularly sampled replicates and clusters. BMC Bioinformatics, 14:252. Snow, R., O Connor, B., Jurafsky, D., and Ng, A. Y. (2008). Cheap and Fast - But is it Good?: Evaluating Non-Expert Annotations for Natural Language Tasks. 23 / 23

In Proceedings of EMNLP. Strapparava, C. and Mihalcea, R. (2007). SemEval-2007 Task 14 : Affective Text. In Proceedings of SEMEVAL. Strapparava, C. and Mihalcea, R. (2008). Learning to identify emotions in text. In Proceedings of the 2008 ACM Symposium on Applied Computing. 23 / 23