Dreem Challenge report (team Bussanati)

Similar documents
STA 414/2104: Lecture 8

Logistic Regression. COMP 527 Danushka Bollegala

STA 414/2104: Lecture 8

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Pattern Recognition and Machine Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Nonlinear Classification

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Variable Selection in Data Mining Project

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

A summary of Deep Learning without Poor Local Minima

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Neural Networks: Backpropagation

Statistical Learning Theory and the C-Loss cost function

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Logistic Regression. Machine Learning Fall 2018

Based on the original slides of Hung-yi Lee

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

CSC321 Lecture 8: Optimization

Machine Learning for NLP

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Artificial Neural Network

Machine Learning Lecture 5

CS 188: Artificial Intelligence Spring Announcements

From perceptrons to word embeddings. Simon Šuster University of Groningen

Introduction to Deep Learning CMPT 733. Steven Bergner

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Importance Reweighting Using Adversarial-Collaborative Training

The exam is closed book, closed notes except your one-page cheat sheet.

Computational statistics

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Probabilistic & Unsupervised Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Linear classifiers: Overfitting and regularization

Neural networks and optimization

Machine Learning Lecture 7

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Machine Learning for Biomedical Engineering. Enrico Grisan

CSC321 Lecture 7: Optimization

CPSC 340: Machine Learning and Data Mining

Learning features by contrasting natural images with noise

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Face detection and recognition. Detection Recognition Sally

CSCI-567: Machine Learning (Spring 2019)

Principal Component Analysis (PCA)

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Neural networks and optimization

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Linear Models for Classification

Lecture 7: Con3nuous Latent Variable Models

ECE521 Lecture 7/8. Logistic Regression

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

10-701/ Machine Learning - Midterm Exam, Fall 2010

L11: Pattern recognition principles

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Introduction to Machine Learning Midterm Exam

Machine Learning and Related Disciplines

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Generative Clustering, Topic Modeling, & Bayesian Inference

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Neural networks and support vector machines

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

ECE521 Lectures 9 Fully Connected Neural Networks

Reading Group on Deep Learning Session 1

CSC 411: Lecture 04: Logistic Regression

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Machine Learning, Midterm Exam

Convolutional Neural Networks

Linear & nonlinear classifiers

Data Mining Part 5. Prediction

Artificial Neural Networks

Gradient-Based Learning. Sargur N. Srihari

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Artificial Intelligence

Introduction to Machine Learning

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Deep Learning for NLP

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Comparison of Modern Stochastic Optimization Algorithms

Robustness of Principal Components

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Neural Networks and Deep Learning

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

OPTIMIZATION METHODS IN DEEP LEARNING

Nonparametric Bayesian Methods (Gaussian Processes)

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Optimization and Gradient Descent

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Transcription:

Wavelet course, MVA 04-05 Simon Bussy, simon.bussy@gmail.com Antoine Recanati, arecanat@ens-cachan.fr Dreem Challenge report (team Bussanati) Description and specifics of the challenge We worked on the challenge proposed by the start-up Dreem. The data consists of epochs of seconds of EEG signals, sampled at 00 Hz during a normal night on 0 different subjects. These epochs are labeled as 0 or. A label 0 means that no Slow Oscillation has been measured during the 0.5 seconds following the seconds epoch. A label means that a slow oscillation occurred during the 0.5 seconds following the second epoch. Figure shows plots of a few of these time series. An efficient detection of slow oscillations Figure : example of time series would enable the start-up product to help the user switching stages of sleep by stimulating the appearance of slow oscillation. One specificity of the dataset is that it is skewed : there is only 7% of positives. Therefore, the metric used to score the results is the AUC metric, that is the area under the ROC curve. Another interesting point Page of 8

is that the data contains the IDs of the subject, allowing us to fit separately the data for each subject. In figure is shown the proportion of data that comes from each subject. The different subjects happen to have different behaviors : there are substantial discrepancies in the positive rates according to the subjects, as one can see from. Figure : proportion of each subject in the training set A first look at the data does not provide any obvious trick to help classifying the data. For example, We plotted boxplots of the means and ranges (5) of positives and negatives classes, and in figure 4 are plotted Fourier Transform amplitude coefficients for some positives and negatives examples. We have plotted the boxplots for each of the 0 subjects, but this does not give any useful information either. We also tried to implement directly some classifiers (logistic regression, SVM) on the raw data, but these yielded to poor AUC scores. Scattering transforms build invariant, non-linear and little redundant representations of the signal, which can be an accurate to classify signals such as EEG ([],[]). Naive neural networks can lead to similar results, although they usually require more computation and are more like a black box. Page of 8

Figure : positive rate for each subject in the training set. positive rate of the subject each slice s thickness is proportional to the 0 4 Single-Sided Amplitude Spectrum of y(t) 0 4.5 Single-Sided Amplitude Spectrum of y(t) positive ex. negative ex. positive ex. negative ex..5.5 Y(f).5 Y(f) 0.5 0.5 0 0 0 0 0 40 50 60 70 80 90 00 Frequency (Hz) 0 0 0 0 0 40 50 60 70 80 90 00 Frequency (Hz) Figure 4: FFT coefficients for four examples of signals ( positives and negatives per subplot) Page of 8

(a) means (b) ranges Figure 5: boxplots for positive (height ) and negative (height ) classes Page 4 of 8

Approaches and results Having not found any relevant feature on the data enabling us to design an algorithm specific to this problem, we tackled the challenge this way : we first computed scattering transforms of the signal. Then, we tried several algorithms to classify the data, keeping in mind that the data set is skewed and that we want to maximize the AUC score. Finally, we tried to train a classifier with a neural network, and to use the knowledge on the subjects IDs. An additional challenge we took was to classify the data in a reasonable time on our laptops.. Computation of the scattering transforms As one can see from the Fourier Transforms (figure 4), most of the information of the signal is contained below 0 Hz. We want to take a sliding window of at least 0.05s, that is to say of at least 0 points, which makes the next power of two being T = 4 points. We tried several values of T : 4, 5, 6, and we also chosen the quality factors Q by cross-validation, for each order. We let the function T_to_J select the scales j and j (we also tried to use the computations at the third order but it did not change the score). Assuming the signal was translation-invariant, we used the built-in wavelet_factory_d. The optimal values we found (with the classifiers used afterwards) were : T = 5, Q = Q =, j = j = 5. For these values, we show a visualization of the scattering coefficients in figure 6. 4 4 5 5 5 0 5 0 5 0 5 5 0 5 0 5 0 5 4 4 5 5 5 0 5 0 5 0 5 5 0 5 0 5 0 5 Figure 6: scattering coefficients with (right) and without (left) normalization+log on the mean of the negatives.. heuristics We tried to improve the score with a few ad-hoc techniques. Following [], we tried to threshold the scattering coefficients in order to reduce the impact of the noise. We also tried to use only the second order coefficients, in case the difference between positives and negatives would be contained in those rather than in the first order coefficients. Finally, we tested whether renormalizing the scattering coefficients and computing Page 5 of 8

their logarithm improved the classification score. None of these heuristics systematically and substantially improved the score, so we kept the scattering coefficients in their original form... Naive implementation Our first try was to use a usual classifier on these features vectors, e.g. logistic regression. This yielded to an AUC around 0.74. A linear SVM did not score significantly better, and a radial basis function SVM took a very long time to run, andto improve the score, we tried to tackle the fact that the data set was skewed by using algorithms specifically designed to maximize the AUC.. Classification algorithms to maximize the AUC score.. approximated AUC gradient-descent We implemented the gradient-descent method described by Herschtal and Raskutti in []. The main lines of the paper are the following.. We try to optimize a linear classifier.. The empirical AUC can be written as : where β describes the linear classifier, x + j them), x k ˆ AUC( β ) = P Q P Q j= k= g( β.(x + j x k )) () are the features vectors of the positive examples (there are P of are the negative ones, and g is the Heaviside function.. Replacing the Heaviside function by the sigmoid function makes it differentiable and does not change its value much if β is large enough, yielding the rank statistic : R( β) = P Q P Q j= k= s( β.(x + j x k )), s(x) = + e x () 4. An unbiased estimator of the rank statistic, with a variance that is reasonably low, is given by the following expression, making a huge computational (and memory) gain : R l ( β) = Q Q k= s( β.(x + kmodp x k )) () 5. Starting from β with a small norm and increasing it until it is large enables us not to fall into local maxima and end up with a rank statistic close to the true empirical AUC. We implemented this algorithm, initializing β with a linear regression, and the results were quite good ( 0.87)... Probabilistic Principal Component Analysis (PPCA) Another algorithm that seemed suited to unbalanced data sets is the mixture of probabilistic components analysis, proposed by Tipping and Bishop in [4]. This model is a generative model providing a posterior probability distribution of the data given the reduced representation, considered as a latent variable. In a Page 6 of 8

nutshell, like PCA, it is accurate to represent data which lies in a low-dimensional manifold, but the latent variable and probability distribution framework enables us to build a mixture model, and more importantly here, to train a model on the positives and one of the negatives, and compute the probability to belong to each of these classes (as a sum of likelihoods). Here, the fact that the data set is skewed does not change those probabilities, except that there are more examples to fit the negative model than the positive one. We implemented this algorithm, using mixture models and tuning the parameters (number of mixture component and dimension of subspace) for each model by cross-validation, but eventually using more than mixture component lead to over-training. For some values, we could reach a score of almost 0.99 on the train set, but less than 0.6 on the test set. The results of these method were good, but not optimal : on a test set created with 0% of the train set, we found an AUC score of 0.8 (and we have no submissions left!).. Neural networks We then tried artificial neural networks with backpropagation learning algorithm. We tried different preprocessing (fourier basis, scattering transform) and it turned out that the best practice was to directly plug the original time series data as input of the network. We tried different architectures, including those with many hidden layers known as deep learning models. Many layers allow them to compactly represent highly non-linear and highly-varying functions. So the complexity of the model increases, which often leads to higher risks to overfit, as explained in the third section. Because neural network learning is in general not convex, the training could be very complicated, as we noticed. More precisely, the problem of finding optimal parameters of a Neural Network is in general not convex, implying there is no guarantee of global solution. Then curse of dimensionality phenomena arise when analyzing data in high-dimensional spaces and the amount of data needed to support the result often grows exponentially with the dimensionality. So hyper-parameter values selection becomes harder with deep neural networks. Hyper-parameters associated with the neural network structure and the training algorithm need to be selected, some of them are : the initial learning rate, the learning rate schedule, the number of training iterations, number of hidden units, the non-linearity function, the preprocessing steps etc. And there is no clear theoretical solution for values of any of these hyper-parameters for the moment. Thats why we need to make a lot of cross validations in different parts of the parameters space chosen using interpretation and intuition coming from previous results. The optimal architecture is four layers with 4,,0,9 neurons, and 400 iterations..4 Use of additional knowledge on the subject IDs We tried to use the extra information contained in the subjects IDs in two ways : Page 7 of 8

. train a classifier for each subject. train a classifier with all subjects and then add a subject-dependent term to the final classifier The first idea did not improve the results because it lead to overfitting, as the number of positives in the training set for a single subject was often too low (order of magnitude : 00 examples, whereas there were around 600 features) to properly train the classifier. The second idea was to use all the training set to have a robust classifier, and then compare the predicted positive rates of those classifiers to the effective ones, and artificially make the negatives that are the closest to being classified as positive, positives, so as to have the same rates. This gave a very slight increase in the performance. Conclusions To test these algorithms, we have separated our train set into a sub training set (80%) and a test set (0%). We thought the AUC score was the same with likelihood or with labels (0 and ), but it actually is not, so we realized very late before the dead line that the gradient-descent algorithm and PPCA were good. Until then, our best scores had been obtained with the neural network. Now, we don t have enough time nor submissions to try to combine the results from different algorithms, but we had fun doing this project anyway! References [] Wavelets in Neurosciences, Hramov, A.E., Koronovskii, A.A., Makarov, V.A., Pavlov, A.N., Sitnikova, E., Springer [] Wavelet based EEG denoising for automatic sleep stage classification, Edson Estrada, Homer Nazeran, Gustavo Sierra, Farideh Ebrahimi, S. Kamaledin Setarehdan [] Optimising Area Under the ROC Curve Using Gradient Descent, Alan Herschtal, Bhavani Raskutti [4] Mixtures of Probabilistic Principal Component Analysers, Michael E. Tipping and Christopher M. Bishop, Neural Computation (), pp 4448. MIT Press. Page 8 of 8