Neural Networks and Machine Learning research at the Laboratory of Computer and Information Science, Helsinki University of Technology

Similar documents
Automatic Rank Determination in Projective Nonnegative Matrix Factorization

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Bayesian ensemble learning of generative models

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

STA 414/2104: Lecture 8

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 7: Con3nuous Latent Variable Models

Learning features by contrasting natural images with noise

Intelligent Systems (AI-2)

c Springer, Reprinted with permission.

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

CSC321 Lecture 20: Autoencoders

TUTORIAL PART 1 Unsupervised Learning

Jae-Bong Lee 1 and Bernard A. Megrey 2. International Symposium on Climate Change Effects on Fish and Fisheries

PROJECTIVE NON-NEGATIVE MATRIX FACTORIZATION WITH APPLICATIONS TO FACIAL IMAGE PROCESSING

Linear and Non-Linear Dimensionality Reduction

Sample Exam COMP 9444 NEURAL NETWORKS Solutions

STA 414/2104: Machine Learning

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen

Clustering by Low-Rank Doubly Stochastic Matrix Decomposition

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Lecture 16 Deep Neural Generative Models

Intelligent Systems (AI-2)

Principal Component Analysis (PCA)

Deep Learning Autoencoder Models

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Nonparametric Bayesian Methods (Gaussian Processes)

STA 414/2104: Lecture 8

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method

Estimation of linear non-gaussian acyclic models for latent factors

Opportunities and challenges in quantum-enhanced machine learning in near-term quantum computers

Pattern Recognition and Machine Learning

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

Neural networks: Unsupervised learning

Data Mining Techniques

Dimension Reduction (PCA, ICA, CCA, FLD,

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Structured matrix factorizations. Example: Eigenfaces

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Online Dictionary Learning with Group Structure Inducing Norms

On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering

y(n) Time Series Data

Deep Learning Architecture for Univariate Time Series Forecasting

EUSIPCO

Support Vector Machine. Industrial AI Lab.

arxiv: v3 [cs.lg] 18 Mar 2013

Machine learning for pervasive systems Classification in high-dimensional spaces

Deep Learning for NLP

Probability and Information Theory. Sargur N. Srihari

OBJECT DETECTION AND RECOGNITION IN DIGITAL IMAGES

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Bayesian Deep Learning

Introduction to Deep Neural Networks

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Knowledge Extraction from DBNs for Images

Density Propagation for Continuous Temporal Chains Generative and Discriminative Models

Variational Autoencoders. Presented by Alex Beatson Materials from Yann LeCun, Jaan Altosaar, Shakir Mohamed

NetBox: A Probabilistic Method for Analyzing Market Basket Data

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

A Neural Network learning Relative Distances

STA 4273H: Statistical Machine Learning

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Artificial Neural Networks Examination, June 2004

Machine Learning - MT & 14. PCA and MDS

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

ARTEFACT DETECTION IN ASTROPHYSICAL IMAGE DATA USING INDEPENDENT COMPONENT ANALYSIS. Maria Funaro, Erkki Oja, and Harri Valpola

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks

Group Sparse Non-negative Matrix Factorization for Multi-Manifold Learning

Introduction to Neural Networks

Unsupervised Learning

Kernel methods, kernel SVM and ridge regression

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Cheng Soon Ong & Christian Walder. Canberra February June 2018

STA 4273H: Statistical Machine Learning

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

University of Genova - DITEN. Smart Patrolling. video and SIgnal Processing for Telecommunications ISIP40

Basic Principles of Unsupervised and Unsupervised

Recent Advances in Bayesian Inference Techniques

Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning

Bayesian Networks Inference with Probabilistic Graphical Models

Research Article Comparative Features Extraction Techniques for Electrocardiogram Images Regression

Learning Deep Architectures

Dimension Reduction Using Nonnegative Matrix Tri-Factorization in Multi-label Classification

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks

* Matrix Factorization and Recommendation Systems

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

MULTIPLICATIVE ALGORITHM FOR CORRENTROPY-BASED NONNEGATIVE MATRIX FACTORIZATION

Introduction to Probabilistic Machine Learning

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Parameter Estimation. Industrial AI Lab.

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Transcription:

Neural Networks and Machine Learning research at the Laboratory of Computer and Information Science, Helsinki University of Technology Erkki Oja Department of Computer Science Aalto University, Finland AIHelsinki Seminar, April 28, 2017

CIS lab: the pre-history

CIS Lab Established first in 1965 as the electronics lab at Dept. of Technical Physics, HUT Changed names many times Chair 1965 1993: Teuvo Kohonen, 1993 2009: Erkki Oja. Then restructured Now part of the big CS Department at Aalto.

Neural networks

Neural networks Teuvo Kohonen started research into neural networks in early 1970 s (associative memories, subspace classifiers, speech recognition); then the Self-Organizing Map (SOM) from early 1980 s Teuvo will give a presentation in AIHelsinki later this spring E. Oja completed his PhD in 1977 with Teuvo.

Part of my PhD Thesis Kohonen-Oja paper

One of my Postdoc papers Cooper et al

Subspace book

Neural networks and AI 1970 s and 1980 s was the time of deep contradictions between neural computation (connectionism) and true AI True AI was symbolic, the mainstream was expert (knowledge-based) systems, search, logic, frames, semantic nets, Lisp, Prolog etc. In late 1980 s, probabilistic (Bayesian) reasoning and neural networks slowly sneaked into AI.

Le Net by Yann le Cun, 1989

The first ICANN ever, in 1991

My problem at that time: what is nonlinear Principal Component Analysis (PCA)? My solution: a novel neural network, deep auto-encoder E. Oja: Data compression, feature extraction, and auto-association in feedforward neural networks. Proc. ICANN 1991, pp. 737-745.

Deep auto-encoder (from the paper)

The trick is that a data vector x is both the input and the desired output. This was one of the first papers on multilayer (deep) auto-encoders, which today are quite popular. In those days, this was quite difficult to train. Newer results: Hinton and Zemel (1994), Bengio (2009), and many others.

CIS lab: the past 25 years

The research was structured by 4 consequent Centers of Excellence financed by the Academy of Finland Neural Networks Research Centre NNRC, 1994 1999 Continuation, 2000 2005 Adaptive Informatics Research Centre AIRC, 2006 2011 Computational Inference Research Centre COIN, 2012-2017

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE

COIN Centre of Excellence in Computational Inference: Introduction Erkki Oja, Director of COIN

Computational logic, Intelligent systems (Niemelä, Myllymäki) Added Value Computational statistics, Computational biology (Corander, Aurell) Data analysis, Machine learning (Oja, Kaski, Laaksonen)

Some ML algorithms studied at the CoE s Visualization, nonlin. dim. reduction Probabilistic Latent Variable Models, Bayes Bayes blocks Relevance by data fusion Nonl.dynamics, subspaces Relational models Sparse PCA, DSS, nonneg. projections Nonlinear, non-neg. BSS Linear mixtures SOM ICA FastICA Reliability Reliability Page 20 Department of Information and Computer Science

Demonstration of Independent Component Analysis (ICA): Original 9 independent images

9 mixtures with random mixing; this is the only available data we have

Estimated original images, found by an ICA algorithm

My own present research topic: matrix factorizations for data analysis

Some ML algorithms studied at the CoE s Visualization, nonlin. dim. reduction Probabilistic Latent Variable Models, Bayes Bayes blocks Relevance by data fusion Nonl.dynamics, subspaces Relational models Sparse PCA, DSS, nonneg. projections Nonlinear, non-neg. BSS Linear mixtures SOM ICA FastICA Reliability Reliability Page 26 Department of Information and Computer Science

Example: spatio-temporal data Graphically, the situation may be like this: time s p a c e s p a c e time H X W

Global daily temperature (10.512 points x 20.440 days)

E.g. global warming component One row of matrix H Correspond ing column of matrix W

A successful example: the Netflix competition

Non-negative matrix factorization

NMF and its extensions is today quite an active research topic Tensor factorizations (Cichocki et al, 2009) Low-rank approximation (LRA) (Markovsky, 2012) Missing data (Koren et al, 2009) Robust and sparse PCA (Candés et al, 2011) Symmetric NMF and clustering (Ding et al, 2012)

NMF and clustering Clustering is a very classical problem, in which n vectors (data items) must be partitioned into r clusters. The clustering result can be shown by the nxr cluster indicator matrix H It is a binary matrix whose element h 1 ij if and only if the i-th data vector belongs to the j-th cluster

The k-means algorithm is minimizing the cost function: 2 J r j 1 If the indicator matrix is suitably normalized then this becomes equal to (Ding et al, 2012) J X x C i j Notice the similarity to NMF and PCA! ( Binary PCA ) x i XHH c T j 2

Actually, minimizing this (for H) is mathematically equivalent to maximizing tr( X T XHH T ) which immediately allows the kernel trick of replacing X T X with kernel k( x, x ) i j, extending k-means to any data structures (Yang and Oja, IEEE Tr-Neural Networks, 2010).

A novel clustering method: DCD Starting again from the binary cluster indicator matrix H, we can define another binary matrix called cluster incidence matrix defined as M HH Its ij-th element is equal to one if the i-th and the j-th data item are in the same cluster, zero otherwise. T

It is customary to normalize it so that the row sums (and column sums, because it is symmetric) are equal to 1 (Shi and Malik, 2000). Call the normalized matrix also M. Assume a suitable similarity measure S ij between every i-th and j-th data items (for example a kernel). Then a nice criterion is: J S M

This is an example of symmetrical NMF because both the similarity matrix and the incidence matrix are symmetrical, and both are naturally nonnegative. S is full rank, but the rank of M is r. Contrary to the usual NMF, there are two extra constraints: the row sums of M are equal to 1, and M is a (scaled) binary matrix. The solution: probabilistic relaxation to smooth the constraints (Yang, Corander and Oja, JMLR, 2016)

Data-cluster-data (DCD) random walk

Clustering results for large datasets DCD k-means Ncut

Clustering results for large datasets DCD k-means Ncut

CIS lab : the future Now part of the CS department at Aalto School of Science Less isolated, much partnerships (other CS groups, HIIT, Helsinki University etc.) Talented researchers, increasingly international Strong impact on Machine Learning in Finland and in the world, in research and teaching Example: our M.Sc. Program Macadamia (Machine Learning and Data Mining; Mannila & Oja 2008)

Macadamia was the 3rd most popular M.Sc. Program in Aalto School of Science in 2017

THANK YOU FOR YOUR ATTENTION!