Multi-scale Geometric Summaries for Similarity-based Upstream S

Similar documents
Lecture: Face Recognition

Sensitivity of the Elucidative Fusion System to the Choice of the Underlying Similarity Metric

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Convolutional Neural Networks

Review: Learning Bimodal Structures in Audio-Visual Data

Face recognition Computer Vision Spring 2018, Lecture 21

Shape of Gaussians as Feature Descriptors

Riemannian Metric Learning for Symmetric Positive Definite Matrices

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Multimodal context analysis and prediction

Affect recognition from facial movements and body gestures by hierarchical deep spatio-temporal features and fusion strategy

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Homework 1. Yuan Yao. September 18, 2011

Deep Learning of Invariant Spatiotemporal Features from Video. Bo Chen, Jo-Anne Ting, Ben Marlin, Nando de Freitas University of British Columbia

Classification of Hand-Written Digits Using Scattering Convolutional Network

Image Similarity Test Using Eigenface Calculation

Deep Learning: Approximation of Functions by Composition

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

2. Definition & Classification

INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

Frequency, Vibration, and Fourier

Lecture: Face Recognition and Feature Reduction

Statistical Modeling of Visual Patterns: Part I --- From Statistical Physics to Image Models

Collaborative topic models: motivations cont

Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity- Representativeness Reward

Boosting: Algorithms and Applications

Introduction to Machine Learning

RAGAV VENKATESAN VIJETHA GATUPALLI BAOXIN LI NEURAL DATASET GENERALITY

Orientation Map Based Palmprint Recognition

Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

On the road to Affine Scattering Transform

Lecture: Face Recognition and Feature Reduction

An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

ECE 592 Topics in Data Science

Convolutional Neural Networks. Srikumar Ramalingam

Local Affine Approximators for Improving Knowledge Transfer

Independent Component Analysis and Unsupervised Learning

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

CS 231A Section 1: Linear Algebra & Probability Review

PCA FACE RECOGNITION

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

SIGNALS measured in microphones are often contaminated

Matrix-Tensor and Deep Learning in High Dimensional Data Analysis

Machine Learning with Quantum-Inspired Tensor Networks

FARSI CHARACTER RECOGNITION USING NEW HYBRID FEATURE EXTRACTION METHODS

Rate-Invariant Analysis of Trajectories on Riemannian Manifolds with Application in Visual Speech Recognition

Tennis player segmentation for semantic behavior analysis

Lecture 9. Time series prediction

Principal Components Analysis. Sargur Srihari University at Buffalo

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Latent Semantic Analysis. Hongning Wang

Tensor Canonical Correlation Analysis and Its applications

Convolutional Associative Memory: FIR Filter Model of Synapse

A Hierarchical Convolutional Neural Network for Mitosis Detection in Phase-Contrast Microscopy Images

EfficientLow-rank Multimodal Fusion With Modality-specific Factors

CHAPTER 5 OPTIMUM SPECTRUM MASK BASED MEDICAL IMAGE FUSION USING GRAY WOLF OPTIMIZATION

Keywords Eigenface, face recognition, kernel principal component analysis, machine learning. II. LITERATURE REVIEW & OVERVIEW OF PROPOSED METHODOLOGY

Introduction to Deep Learning CMPT 733. Steven Bergner

Latent Semantic Analysis. Hongning Wang

Hidden Markov Models. Dr. Naomi Harte

Recent advances in Time Series Classification

Nonnegative Matrix Factorization Clustering on Multiple Manifolds

Data Informatics. Seon Ho Kim, Ph.D.

A Contrario Detection of False Matches in Iris Recognition

How to do backpropagation in a brain

When Dictionary Learning Meets Classification

Hidden Markov Models Part 1: Introduction

Knowledge Extraction from DBNs for Images

Topics in Harmonic Analysis, Sparse Representations, and Data Analysis

Hierarchical Boosting and Filter Generation

Probabilistic Graphical Models

Using Entropy and 2-D Correlation Coefficient as Measuring Indices for Impulsive Noise Reduction Techniques

Intraframe Prediction with Intraframe Update Step for Motion-Compensated Lifted Wavelet Video Coding

Modeling Timing Structure in Multimedia Signals

Activity Mining in Sensor Networks

Lecture 17: Face Recogni2on

Abnormal Activity Detection and Tracking Namrata Vaswani Dept. of Electrical and Computer Engineering Iowa State University

MULTISCALE SCATTERING FOR AUDIO CLASSIFICATION

A Generative Model Based Kernel for SVM Classification in Multimedia Applications

Towards airborne seismic. Thomas Rapstine & Paul Sava

Improved Kalman Filter Initialisation using Neurofuzzy Estimation

Linear Classifiers as Pattern Detectors

Image classification via Scattering Transform

Invariant Scattering Convolution Networks

Orbital Insight Energy: Oil Storage v5.1 Methodologies & Data Documentation

Principal Component Analysis

N-mode Analysis (Tensor Framework) Behrouz Saghafi

A Confidence-Based Late Fusion Framework For Audio-Visual Biometric Identification

Hierarchical Clustering of Dynamical Systems based on Eigenvalue Constraints

Single Channel Music Sound Separation Based on Spectrogram Decomposition and Note Classification

ARestricted Boltzmann machine (RBM) [1] is a probabilistic

Cambridge University Press The Mathematics of Signal Processing Steven B. Damelin and Willard Miller Excerpt More information

CSC321 Lecture 9: Generalization

Convolutional Dictionary Learning and Feature Design

Deep Learning (CNNs)

Transcription:

Multi-scale Geometric Summaries for Similarity-based Upstream Sensor Fusion Duke University, ECE / Math 3/6/2019

Overall Goals / Design Choices Leverage multiple, heterogeneous modalities in identification Develop general tools without domain specific models Techniques are unsupervised (no training data required)

OuluVS2 Digits Dataset 51 speakers 10 sequences, 3 instances per speaker per sequence Video from multiple points of view, audio http://www.ee.oulu.fi/research/imag/ouluvs2/ index.html

Why Digits? Modalities capture different aspects ( p versus b ) Variation across speakers and across runs Even after uniformly scaling, the raw audio signals do not align perfectly in time

Problems And Success Metrics Decompose set of digit strings various ways: by digit string, by speaker, by speaker and digit string Goal is to come up with similarity ranking mechanism µ s.t. For each object s, µ(s, t) is larger when t is in same class as s (Rusinkiewicz and Funkhouser 2009)

Problems And Success Metrics Success Evaluated by precision-recall curves for each object s Recall: Proportion of class items considered in an ordered list by similarity Precision: The proportion of items that are actually correct

Problems And Success Metrics Success Evaluated by precision-recall curves for each object s Report average P-R curves Area under P-R curve is mean average precision (MAP)

Other approches, our pipeline(s) Many approaches (including ours) construct µ via mapping strings into a feature space Lots of deep learning approaches (Lopez and Sukno, 2018) HMM per class, use canonical correlation analysis to learn good ways to extract fused audio/visual features (Sargin et al, 2007) We propose a set of entirely unsupervised pipelines Labeled examples used only to evaluate not to train s s s

Self-Similarity Matrices (SSMs) D ij = X i X j 2

Why SSMs? Imran N Junejo et al. View-independent action recognition from temporal self-similarities. In: IEEE transactions on pattern analysis and machine intelligence 33.1 (2011), pp. 172 185

SSMs on Our Data Video: Extract lip region from each frame and rescale to 25 25 grayscale Treat as time series in 25 25 = 625 dim Euclidean space Audio: Break audio signal into overlapping windows Summarize each window via 20 MFCC coefficients Treat as time series in 20 dimensional Euclidean space

Similarity Network Fusion (SNF) Transform several weight matrices W 1,..., W m into one that (hopefully) has best qualities of all Based on random walks with cross-talk between matrices for probabilities (works best if modalities are complementary) Bo Wang et al. Unsupervised metric fusion by cross diffusion. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE. 2012, pp. 2997 3004 Bo Wang Christopher et al. Similarity Tralie, Paul Bendich, network John fusion Harerfor aggregating Multi-scale Geometric data types Summaries on afor genomic Similarity-based Upstream S

SNF for Early Audio-Visual Fusion We use SNF to fuse MFCC (audio) and lip pixel (video) SSMs (W ) A (W ) v (W ) F (W ) A (W ) v c (W ) F a b 9 7 4 4 4 3 5 5 8 7 a: repeating 4s, b: repeating 5s, c: repeating 7s

How To Compare (Fused) SSMs? Each string s transformed into SSM W A (s), W v (s), then fused into W F (s) How to compare W F (s) with W F (s )? Could just use l 2 (Matrix Frobenius Norm) s s s

Measuring Similarity between SSMs Each string s transformed into SSM W A (s), W v (s), then fused into W F (s) How to compare W F (s) with W F (s )? Could just use l 2 (Matrix Frobenius Norm) Local delays (time warps) induce local perturbations in SSMs l 2 norm unstable to these perturbations

The Scattering Transform Instead of l 2, use the scattering transform on SSMs Has nice theoretical stability properties

The Scattering Transform: A Few Details Given an N N image I(u, v), choose lowpass filter φ(u, v) Level 0: S 0 (u, v) = I φ(u, v) There are d d total coefficients: d = N/2 J 1, J max scale

The Scattering Transform: A Few Details Now choose a mother wavelet ψ(u, v), a set of L directions γ i, and a set of J scales j 0, 1,..., J 1 Level 1: S 1 i,j (u, v) = I 2 2j ψ γi (u/2 j, v/2 j ) φ(u, v) Using complex Gabor wavelets: ψ γ = e iγ (u,v) e (u2 +v 2 )/σ 2

The Scattering Transform: A Few Details Now choose a mother wavelet ψ(u, v), a set of L directions γ i, and a set of J scales j 0, 1,..., J 1 Level 1: S 1 i,j (u, v) = I 2 2j ψ γi (u/2 j, v/2 j ) φ(u, v) There are d 2 LJ level 1 coefficients

The Scattering Transform: A Few Details Level 2: S 2 i,j,k,l(u, v) = I 2 2j ψ γi (u/2 j, v/2 j ) 2 2l ψ γk (u/2 l, v/2 l ) φ(u, v) (1) There are d 2 L 2 J(J 1)/2 level 2 coefficients

The Scattering Transform: A Few Details One can continue past level 2, but we stop there Repeated convolve-with-wavelet, take complex modulus, do low-pass filter gives CNN-style architecture, but unsupervised. Each choice of wavelets in sequence is called a path

Scattering Transform As Feature Extractor Resize each SSM to 256 256 resolution Take L = 8 equally spaced directions between 0 and π Take J = 4 scales, so that each path is 32 32 Results in 32 2 (1 + 4 8 + 8 2 4 3/2) = 427, 008 scattering coefficients extracted from SSM (6.5x data size, but stable)

Scattering Transform As Feature Extractor Example scattering SSM

SNF for Late Audio-Visual Fusion Everything so far has happened upstream: before ranking decisions are made Can also apply SNF downstream Given object-level metrics µ 1,..., µ k on set of N objects (strings) Each one produces object-level SSMs, which can themselves be fused into a new SSM We apply that here with k = 3 (audio, visual, early fused) s s s

Results: Digit String Identification

Results: Digit String Identification, Simulated Noise 26 20 16.5 14 12 10.5 PSNR (db)

Results: Speaker Identification, Simulated Noise 26 20 16.5 14 12 10.5 PSNR (db)

Results: Joint Speaker And String Identification, Simulated Noise 26 20 16.5 14 12 10.5 PSNR (db)