IMISOUND: An Unsupervised System for Sound Query by Vocal Imitation

Size: px

Start display at page:

Download "IMISOUND: An Unsupervised System for Sound Query by Vocal Imitation"

Jeffery Preston
6 years ago
Views:

1 IMISOUND: An Unsupervised System for Sound Query by Vocal Imitation Yichi Zhang and Zhiyao Duan Audio Information Research (AIR) Lab Department of Electrical and Computer Engineering University of Rochester

2 Query by vocal imitation 1

3 Query by vocal imitation For general sounds: Dog barking sound (w/ semantic meaning) infantile bark or threat bark Vocal imitation: narrow down the concept Synthesized sound (w/o semantic meaning) Vocal imitation: might be the only way to convey the concept 2

4 Towards Sound Retrieval d 1 d 2 d 3 d N 3

5 Challenges People tend to imitate different aspects for different recordings car horn: [ ] cat: [ ] guitar note: [ ] Even for the same recording, different people may imitate differently car horn 1: car horn 2: car horn 3: Hand crafted features such as pitch, timbre, loudness, etc. would not work well Solution: Deep Neural Networks (DNN) 4

6 Automatic feature learning [1] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, Unsupervised learning of Hierarchical representations with convolutional deep belief networks,

7 Pre-processing Constant Q Transform (CQT) spectrogram Parameters: Patch length: 525 ms (20 frames) Freq range: Hz (6 octaves) 12 bins per octave Rationale: one syllable in normal English speech: 200 ms 50 Hz to 3200 Hz basically covers telephone frequency range Friday, July 14, 2017 University of Rochester 6

8 Feature Extraction Stacked Auto-encoder (SAE) is chosen as the neural network model x 1 x 2 x 3 w 1 (1) z 1 (1) z 2 ' w 1 x 1 x 2 x 3 (1) z 1 (1) z 2 w 2 (2) z 1 (2) z K ' w 2 (1) z 1 (1) z 2 x N (1) z M ' b 1 x N (1) z M b 2 ' b 2 (1) z M b 1 1 st hidden layer neurons = nd hidden layer neurons = 100 7

9 Feature Extraction Auto-encoder tries to learn the weights and biases so that the output could approximate the input x 1 y 1 z 1 x 2 y 2 z 2 x 3 y 3 z M x N b ' b y N Weights are trained by half of all the vocal imitations 8

10 Distance Calculation Features of Imitation Query d 1 Features of Sound Candidate 1 d 2 This is output of the SAE d 3 Features of Sound Candidate 2 These are outputs of the SAE d N Features of Sound Candidate 3 Features of Sound Candidate N 9

11 K-L Divergence 1 1 P() i Q() i Dkl _ sym ( P Q) ( Dkl ( P Q) Dkl ( Q P)) ( P( i) ln Q( i) ln ) 2 2 Qi () Pi () i i Vocal imitation Sound candidate 10

12 DTW Distance 11

13 DTW Distance 12

14 Distance Calculation K-L divergence: dissimilarity in probability distribution DTW distance: dissimilarity in temporal domain D DKL DDTW max( D ) max( D ) KL DTW 13

15 Sound Retrieval Example DTW distance Triangle Orchestra bells Windgong 0.6 Trumpet Tambourine 0.5 Viola (bowed) Vibraphone (sustained) 0.4 Viola (plucked) Vibraphone (bowed) Woodblock Tuba Violin (bowed) 0.3 Thaigong Piano Tambourine 0.2 Oboe (shake roll) Marimba hit with 0.1 a rubber stick Xylophone Violin (plucked) Trombone K-L divergence 14

16 Dataset Table 1. VocalSketch Data Set v1.0.4 [1] Category Sound Concepts (#) Examples Acoustic Instruments Everyday Commercial Synthesizers Single Synthesizer Orchestral instruments playing a C note (40) Acoustic events in everyday life (120) Apple s Logic Pro (40) A single 15-parameter subtractive synthesizer playing C note (40) Orchestra bells Triangle Knocking Sheep Metaloid Shimmer Subsynth_2217 Subsynth_8828 Each class has 10 vocal imitations on average [1] M. Cartwright and B. Pardo, VocalSketch: Vocally imitating audio concepts, in Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems,

17 Evaluation Measure Mean Reciprocal Rank (MRR) Number of queries in experiment 0 <= MRR <= 1 The higher the better Q 1 1 MRR Q rank i 1 i Rank of the target sound in the returned sound list for the i-th query 16

18 Experimental Setup 1) Use vocal imitations of half of all the sound concepts to train the SAE # hidden layers = 2 # neurons in the 1 st hidden layer = 500 # neurons in the 2 nd hidden layer = 100 2) Use the other half for sound retrieval experiment within each category # sound concepts in Acoustic Instruments = 20 # sound concepts in Commercial Synthesizers = 20 # sound concepts in Everyday = 60 # sound concepts in Single Synthesizer = 20 17

19 Comparison Method Hand-crafted features: Mel-frequency cepstral coefficients (MFCC) 39-dimensional MFCC vectors, including 13 MFCC coefficients 13 first-order derivatives 13 second-order derivatives 18

20 Proposed Baseline Proposed Baseline MRR 0.3 MRR K - L & DTW K - L DTW Acoustic Instruments 0 K - L & DTW K - L DTW Commercial Synthesizers Proposed Baseline Proposed Baseline MRR 0.3 MRR K - L & DTW K - L DTW K - L & DTW K - L DTW Everyday Single Synthesizer # neurons in the 1 st hidden layer: 500 # neurons in the 2 nd hidden layer: 100

21 Proposed Baseline Proposed Baseline MRR 0.3 MRR K - L & DTW K - L DTW Acoustic Instruments 0 K - L & DTW K - L DTW Commercial Synthesizers Proposed Baseline Proposed Baseline MRR 0.3 MRR K - L & DTW K - L DTW K - L & DTW K - L DTW Everyday Single Synthesizer # neurons in the 1 st hidden layer: 1000 # neurons in the 2 nd hidden layer: 600

22 Conclusions & future work Conclusions Proposed the first unsupervised sound query-by-vocalimitation system which is evaluated in a large dataset Achieved significantly better results by automatic feature learning than hand-crafted features Future work Experiments on CNN and RNN Vision Sound query by vocal imitation will be the trend 21

23 The End Thank you for your attention!

24 Supervised Query-by-Vocal- Imitation System Assumptions: Closed set scenario Training data exist for each concept 23

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753) Lecture 12: Acoustic Feature Extraction for ASR Instructor: Preethi Jyothi Feb 13, 2017 Speech Signal Analysis Generate discrete samples A frame Need to focus on short