Algorithms for NLP. Speech Signals. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Similar documents
Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Speech Recognition. CS 294-5: Statistical Natural Language Processing. State-of-the-Art: Recognition. ASR for Dialog Systems.

Algorithms for NLP. Language Modeling III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley

T Automatic Speech Recognition: From Theory to Practice

Sound 2: frequency analysis

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Speech Spectra and Spectrograms

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

COMP 546, Winter 2018 lecture 19 - sound 2

Current Word Previous Word Next Word Current Word Character n-gram all Current POS Tag Surrounding POS Tag Sequence Current Word Shape Surrounding

SPEECH COMMUNICATION 6.541J J-HST710J Spring 2004

Lecture 5: GMM Acoustic Modeling and Feature Extraction

CS 188: Artificial Intelligence Fall 2011

Object Tracking and Asynchrony in Audio- Visual Speech Recognition

Tuesday, August 26, 14. Articulatory Phonology

ECE 598: The Speech Chain. Lecture 9: Consonants

Constriction Degree and Sound Sources

Signal representations: Cepstrum

Supporting Information

3 THE P HYSICS PHYSICS OF SOUND

A Speech Enhancement System Based on Statistical and Acoustic-Phonetic Knowledge

Discriminative Pronunciation Modeling: A Large-Margin Feature-Rich Approach

Functional principal component analysis of vocal tract area functions

Resonances and mode shapes of the human vocal tract during vowel production

L7: Linear prediction of speech

CS 343: Artificial Intelligence

Ad Placement Strategies

CS 343: Artificial Intelligence

L8: Source estimation

Log-Linear Models, MEMMs, and CRFs

Regression with Numerical Optimization. Logistic

Witsuwit en phonetics and phonology. LING 200 Spring 2006

Machine Learning for NLP

CS 5522: Artificial Intelligence II

Advanced computational methods X Selected Topics: SGD

On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Parametric Specification of Constriction Gestures

Speech production by means of a hydrodynamic model and a discrete-time description

ENTROPY RATE-BASED STATIONARY / NON-STATIONARY SEGMENTATION OF SPEECH

Deep Learning Recurrent Networks 2/28/2018

Determining the Shape of a Human Vocal Tract From Pressure Measurements at the Lips

word2vec Parameter Learning Explained

Voltage Maps. Nearest Neighbor. Alternative. di: distance to electrode i N: number of neighbor electrodes Vi: voltage at electrode i

Two critical areas of change as we diverged from our evolutionary relatives: 1. Changes to peripheral mechanisms

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 5: Logistic Regression. Neural Networks

Conditional Random Fields for Sequential Supervised Learning

CS 5522: Artificial Intelligence II

8 M. Hasegawa-Johnson. DRAFT COPY.

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Rapid Introduction to Machine Learning/ Deep Learning

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Neural Network Training

Day 3 Lecture 3. Optimizing deep networks

Sound Correspondences in the World's Languages: Online Supplementary Materials

Statistical NLP Spring A Discriminative Approach

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析

Optimization for neural networks

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Chapter 12 Sound in Medicine

Source/Filter Model. Markus Flohberger. Acoustic Tube Models Linear Prediction Formant Synthesizer.

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Logistic Regression. William Cohen

CS60021: Scalable Data Mining. Large Scale Machine Learning

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Adaptive Gradient Methods AdaGrad / Adam

Acoustic Modeling for Speech Recognition

Machine Learning for NLP

From perceptrons to word embeddings. Simon Šuster University of Groningen

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Lecture 3: Pattern Classification

An overview of word2vec

Semi-Supervised Learning of Speech Sounds

A Linguistic Feature Representation of the Speech Waveform

Machine Learning Lecture 7

Natural Language Processing and Recurrent Neural Networks

Lecture 6: Neural Networks for Representing Word Meaning

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Introduction to Logistic Regression and Support Vector Machine

Selected Topics in Optimization. Some slides borrowed from

N-gram Language Modeling

CPSC 340 Assignment 4 (due November 17 ATE)

Conditional Language Modeling. Chris Dyer

The effect of speaking rate and vowel context on the perception of consonants. in babble noise

Speech Recognition. Department of Computer Science & Information Engineering National Taiwan Normal University. References:

Naïve Bayes, Maxent and Neural Models

Overfitting, Bias / Variance Analysis

Deep Learning for NLP

Natural Language Processing

Linear Regression (continued)

Source localization in an ocean waveguide using supervised machine learning

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Transcription:

Algorithms for NLP Speech Signals Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Maximum Entropy Models

Improving on N-Grams? N-grams don t combine multiple sources of evidence well P(construction After the demolition was completed, the) Here: the gives syntactic constraint demolition gives semantic constraint Unlikely the interaction between these two has been densely observed in this specific n-gram We d like a model that can be more statistically efficient

Some Definitions INPUTS CANDIDATE SET CANDIDATES close the {door, table, } table TRUE OUTPUTS door FEATURE VECTORS x -1 = the Ù y= door x -1 = the Ù y= table close in x Ù y= door y occurs in x

More Features, Less Interaction x = closing the, y = doors N-Grams x -1 = the Ù y= doors Skips x -2 = closing Ù y= doors Lemmas x -2 = close Ù y= door Caching y occurs in x

Data: Feature Impact Features Train Perplexity Test Perplexity 3 gram indicators 241 350 1-3 grams 126 172 1-3 grams + skips 101 164

Exponential Form Weights Features Linear score Unnormalized probability Probability

Likelihood Objective Model form: P (y x, w) = exp(w> f(x, y)) P y 0 exp(w > f(x, y 0 )) Log-likelihood of training data L(w) = log Y P (yi x i,w)= X i i 0 = X i log! exp(w > f(x i,yi P )) y exp(w > f(x 0 i,y 0 )) 1 @w > f(x i,yi ) log X exp(w > f(x i,y 0 )) A y 0

Training

History of Training 1990 s: Specialized methods (e.g. iterative scaling) 2000 s: General-purpose methods (e.g. conjugate gradient) 2010 s: Online methods (e.g. stochastic gradient)

What Does LL Look Like? Example Data: xxxy Two outcomes, x and y One indicator for each Likelihood

Convex Optimization The maxent objective is an unconstrained convex problem One optimal value*, gradients point the way

Gradients Count of features under target labels Expected count of features under model predicted label distribution

Gradient Ascent The maxent objective is an unconstrained optimization problem Gradient Ascent Basic idea: move uphill from current guess Gradient ascent / descent follows the gradient incrementally At local optimum, derivative vector is zero Will converge if step sizes are small enough, but not efficient All we need is to be able to evaluate the function and its derivative

(Quasi)-Newton Methods 2 nd -Order methods: repeatedly create a quadratic approximation and solve it E.g. LBFGS, which tracks derivative to approximate (inverse) Hessian

Regularization

Regularization Methods Early stopping L2: L(w)- w 2 2 L1: L(w)- w

Regularization Effects Early stopping: don t do this L2: weights stay small but non-zero L1: many weights driven to zero Good for sparsity Usually bad for accuracy for NLP

Scaling

Why is Scaling Hard? Big normalization terms Lots of data points

Hierarchical Prediction Hierarchical prediction / softmax [Mikolov et al 2013] Noise-Contrastive Estimation [Mnih, 2013] Self-Normalization [Devlin, 2014] Image: ayende.com

Stochastic Gradient View the gradient as an average over data points Stochastic gradient: take a step each example (or mini-batch) Substantial improvements exist, e.g. AdaGrad (Duchi, 11)

Log-linear Parameterization Model form: P (y x; w) = exp(w> f(x, y)) P y 0 exp(w > f(x, y 0 )) Learn by following gradient of training LL: @L(w) @w = X i f(x i,y i ) X i E P (y xi ;w) [f(x i,y)]

Mixed Interpolation But can t we just interpolate: P(w most recent words) P(w skip contexts) P(w caching) Yes, and people do (well, did) But additive combination tends to flatten distributions, not zero out candidates

Neural LMs

Neural LMs Image: (Bengio et al, 03)

Maxent LM Neural vs Maxent P (y x; w) / exp(w > f(x, y)) Simple Neural LM P (y x; A, B, C) / exp A > y B > C x nonlinear, e.g. tanh

Neural LM Example A man A door A doors e (A> h) P (y x) / 0.1 2.3 1.5 h = B > C x -0.99 C closing 1.2 7.4 C the 3.3 1.1 x -2 = closing x -1 = the

Neural LMs Image: (Bengio et al, 03)

Decision Trees / Forests last verb? Prev Word? Decision trees? Good for non-linear decision problems Random forests can improve further [Xu and Jelinek, 2004] Paths to leaves basically learn conjunctions General contrast between DTs and linear models

Speech Signals

Speech in a Slide n amplitude Frequency gives pitch; amplitude gives volume s p ee ch l a b n Frequencies at each time slice processed into observation vectors..x 12 x 13 x 12 x 14 x 14..

Articulation

Articulatory System Nasal cavity Oral cavity Pharynx Vocal folds (in the larynx) Trachea Lungs Sagittal section of the vocal tract (Techmer 1880) Text from Ohala, Sept 2001, from Sharon Rose slide

Space of Phonemes Standard international phonetic alphabet (IPA) chart of consonants

Place

Places of Articulation dental labial alveolar post-alveolar/palatal velar uvular pharyngeal laryngeal/glottal Figure thanks to Jennifer Venditti

Labial place bilabial labiodental Bilabial: p, b, m Labiodental: f, v Figure thanks to Jennifer Venditti

Coronal place dental alveolar post-alveolar/palatal Dental: th/dh Alveolar: t/d/s/z/l/n Post: sh/zh/y Figure thanks to Jennifer Venditti

Dorsal Place Velar: k/g/ng velar uvular pharyngeal Figure thanks to Jennifer Venditti

Space of Phonemes Standard international phonetic alphabet (IPA) chart of consonants

Manner

Manner of Articulation In addition to varying by place, sounds vary by manner Stop: complete closure of articulators, no air escapes via mouth Oral stop: palate is raised (p, t, k, b, d, g) Nasal stop: oral closure, but palate is lowered (m, n, ng) Fricatives: substantial closure, turbulent: (f, v, s, z) Approximants: slight closure, sonorant: (l, r, w) Vowels: no closure, sonorant: (i, e, a)

Space of Phonemes Standard international phonetic alphabet (IPA) chart of consonants

Vowels

Vowel Space

Acoustics

She just had a baby What can we learn from a wavefile? No gaps between words (!) Vowels are voiced, long, loud Length in time = length in space in waveform picture Voicing: regular peaks in amplitude When stops closed: no peaks, silence Peaks = voicing:.46 to.58 (vowel [iy], from second.65 to.74 (vowel [ax]) and so on Silence of stop closure (1.06 to 1.08 for first [b], or 1.26 to 1.28 for second [b]) Fricatives like [sh]: intense irregular pattern; see.33 to.46

Time-Domain Information pat pad bad spat Example from Ladefoged

Simple Periodic Waves of Sound 0.99 0 œ0. 9 9 0 0.02 Time (s) Y axis: Amplitude = amount of air pressure at that point in time Zero is normal air pressure, negative is rarefaction X axis: Time. Frequency = number of cycles per second. 20 cycles in.02 seconds = 1000 cycles/second = 1000 Hz

Complex Waves: 100Hz+1000Hz 0.99 0 œ0. 9 6 5 4 0 0.05 Time (s)

Spectrum Frequency components (100 and 1000 Hz) on x-axis Amplitude 100 Frequency in Hz 1000

Part of [ae] waveform from had Note complex wave repeating nine times in figure Plus smaller waves which repeats 4 times for every large pattern Large wave has frequency of 250 Hz (9 times in.036 seconds) Small wave roughly 4 times this, or roughly 1000 Hz Two little tiny waves on top of peak of 1000 Hz waves

Spectrum of an Actual Soundwave 40 20 0 0 5000 Frequency (Hz)

Back to Spectra Spectrum represents these freq components Computed by Fourier transform, algorithm which separates out each frequency component of wave. x-axis shows frequency, y-axis shows magnitude (in decibels, a log measure of amplitude) Peaks at 930 Hz, 1860 Hz, and 3020 Hz.

Source / Channel

Why these Peaks? Articulation process: The vocal cord vibrations create harmonics The mouth is an amplifier Depending on shape of mouth, some harmonics are amplified more than others

Vowel [i] at increasing pitches F#2 A2 C3 F#3 A3 C4 (middle C) A4 Figures from Ratree Wayland

Resonances of the Vocal Tract The human vocal tract as an open tube: Closed end Open end Length 17.5 cm. Air in a tube of a given length will tend to vibrate at resonance frequency of tube. Constraint: Pressure differential should be maximal at (closed) glottal end and minimal at (open) lip end. Figure from W. Barry

From Sundberg

Computing the 3 Formants of Schwa Let the length of the tube be L F 1 = c/l 1 = c/(4l) = 35,000/4*17.5 = 500Hz F 2 = c/l 2 = c/(4/3l) = 3c/4L = 3*35,000/4*17.5 = 1500Hz F 3 = c/l 3 = c/(4/5l) = 5c/4L = 5*35,000/4*17.5 = 2500Hz So we expect a neutral vowel to have 3 resonances at 500, 1500, and 2500 Hz These vowel resonances are called formants

From Mark Liberman

Seeing Formants: the Spectrogram

Vowel Space

Spectrograms

How to Read Spectrograms [bab]: closure of lips lowers all formants: so rapid increase in all formants at beginning of "bab [dad]: first formant increases, but F2 and F3 slight fall [gag]: F2 and F3 come together: this is a characteristic of velars. Formant transitions take longer in velars than in alveolars or labials From Ladefoged A Course in Phonetics

She came back and started again 1. lots of high-freq energy 3. closure for k 4. burst of aspiration for k 5. ey vowel; faint 1100 Hz formant is nasalization 6. bilabial nasal 7. short b closure, voicing barely visible. 8. ae; note upward transitions after bilabial stop at beginning 9. note F2 and F3 coming together for "k" From Ladefoged A Course in Phonetics

Deriving Schwa Reminder of basic facts about sound waves f = c/l c = speed of sound (approx 35,000 cm/sec) A sound with l=10 meters: f = 35 Hz (35,000/1000) A sound with l=2 centimeters: f = 17,500 Hz (35,000/2)

American English Vowel Space iy HIGH uw ih ix ux uh FRONT eh ax ah ao BACK ae LOW aa Figures from Jennifer Venditti, H. T. Bunnell

Dialect Issues Speech varies from dialect to dialect (examples are American vs. British English) Syntactic ( I could vs. I could do ) Lexical ( elevator vs. lift ) Phonological Phonetic all American British Mismatch between training and testing dialects can cause a large increase in error rate old